Re: solr 4.7.2 mergeFactor/ Merge policy issue

2015-03-16 Thread Dmitry Kan
Hi,

I can confirm similar behaviour, but for solr 4.3.1. We use default values
for merge related settings. Even though mergeFactor=10
by default, there are 13 segments in one core and 30 segments in another. I
am not sure it proves there is a bug in the merging, because it depends on
the TieredMergePolicy. Relevant discussion from the past:
http://lucene.472066.n3.nabble.com/TieredMergePolicy-reclaimDeletesWeight-td4071487.html
Apart from other policy parameters you could play with ReclaimDeletesWeight,
in case you'd like to affect on merging the segments with deletes in them.
See
http://stackoverflow.com/questions/18361300/informations-about-tieredmergepolicy


Regarding your attachment: I believe it got cut by the mailing list system,
could you share it via a file sharing system?

On Sat, Mar 14, 2015 at 7:36 AM, Summer Shire  wrote:

> Hi All,
>
> Did anyone get a chance to look at my config and the InfoStream File ?
>
> I am very curious to see what you think
>
> thanks,
> Summer
>
> > On Mar 6, 2015, at 5:20 PM, Summer Shire  wrote:
> >
> > Hi All,
> >
> > Here’s more update on where I am at with this.
> > I enabled infoStream logging and quickly figured that I need to get rid
> of maxBufferedDocs. So Erick you
> > were absolutely right on that.
> > I increased my ramBufferSize to 100MB
> > and reduced maxMergeAtOnce to 3 and segmentsPerTier to 3 as well.
> > My config looks like this
> >
> > 
> >false
> >100
> >
> >
> 
> >
> >  3
> >  3
> >
> > class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
> >true
> >  
> >
> > I am attaching a sample infostream log file.
> > In the infoStream logs though you an see how the segments keep on adding
> > and it shows (just an example )
> > allowedSegmentCount=10 vs count=9 (eligible count=9) tooBigCount=0
> >
> > I looked at TieredMergePolicy.java to see how allowedSegmentCount is
> getting calculated
> > // Compute max allowed segs in the index
> >long levelSize = minSegmentBytes;
> >long bytesLeft = totIndexBytes;
> >double allowedSegCount = 0;
> >while(true) {
> >  final double segCountLevel = bytesLeft / (double) levelSize;
> >  if (segCountLevel < segsPerTier) {
> >allowedSegCount += Math.ceil(segCountLevel);
> >break;
> >  }
> >  allowedSegCount += segsPerTier;
> >  bytesLeft -= segsPerTier * levelSize;
> >  levelSize *= maxMergeAtOnce;
> >}
> >int allowedSegCountInt = (int) allowedSegCount;
> > and the minSegmentBytes is calculated as follows
> > // Compute total index bytes & print details about the index
> >long totIndexBytes = 0;
> >long minSegmentBytes = Long.MAX_VALUE;
> >for(SegmentInfoPerCommit info : infosSorted) {
> >  final long segBytes = size(info);
> >  if (verbose()) {
> >String extra = merging.contains(info) ? " [merging]" : "";
> >if (segBytes >= maxMergedSegmentBytes/2.0) {
> >  extra += " [skip: too large]";
> >} else if (segBytes < floorSegmentBytes) {
> >  extra += " [floored]";
> >}
> >message("  seg=" + writer.get().segString(info) + " size=" +
> String.format(Locale.ROOT, "%.3f", segBytes/1024/1024.) + " MB" + extra);
> >  }
> >
> >  minSegmentBytes = Math.min(segBytes, minSegmentBytes);
> >  // Accum total byte size
> >  totIndexBytes += segBytes;
> >}
> >
> >
> > any input is welcome.
> >
> > 
> >
> >
> > thanks,
> > Summer
> >
> >
> >> On Mar 5, 2015, at 8:11 AM, Erick Erickson 
> wrote:
> >>
> >> I would, BTW, either just get rid of the  all together
> or
> >> make it much higher, i.e. 10. I don't think this is really your
> >> problem, but you're creating a lot of segments here.
> >>
> >> But I'm kind of at a loss as to what would be different about your
> setup.
> >> Is there _any_ chance that you have some secondary process looking at
> >> your index that's maintaining open searchers? Any custom code that's
> >> perhaps failing to close searchers? Is this a Unix or Windows system?
> >>
> >> And just to be really clear, you _only_ seeing more segments being
> >> added, right? If you're only counting files in the index directory, it's
> >> _possible_ that merging is happening, you're just seeing new files take
> >> the place of old ones.
> >>
> >> Best,
> >> Erick
> >>
> >> On Wed, Mar 4, 2015 at 7:12 PM, Shawn Heisey 
> wrote:
> >>> On 3/4/2015 4:12 PM, Erick Erickson wrote:
>  I _think_, but don't know for sure, that the merging stuff doesn't get
>  triggered until you commit, it doesn't "just happen".
> 
>  Shot in the dark...
> >>>
> >>> I believe that new segments are created when the indexing buffer
> >>> (ramBufferSizeMB) fills up, even without commits.  I'm pretty sure that
> >>> anytime a new segment is created, the merge policy is checked to see
> >>> whether a merge is needed.
> >>>
> >>> Thanks,
> >>> Shawn
> >>>
> >
>
>


-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitr

Re: [Poll]: User need for Solr security

2015-03-16 Thread Jan Høydahl
Hi,

We tend to recommend ManifoldCF for document level security since that is 
exactly what it is built for. So I doubt we'll see that as a built in feature 
in Solr.
However, the Solr integration is really not that advanced, and I also see 
customers implementing similar logic themselves with success.
On the document feeding side you need to add a few more fields to all your 
documents, typically include_acl and exclude_acl. Populate those fields
with data from LDAP about who (what groups) have access to that document and 
who not. If it is open information, index a special token "open" in the include 
field.
Then assuming your search client application has authenticated a user, you 
would construct a filter with this users groups, e.g. 
  fq=include_acl:(groupA OR open)&fq=-exclude_acl:(groupA)
The filter would be constructed either in your application or in a Solr search 
component or query parser.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 13. mar. 2015 kl. 01.48 skrev johnmu...@aol.com:
> 
> I would love to see record level (or even field level) restricted access in 
> Solr / Lucene.
> 
> This should be group level, LDAP like or some rule base (which can be 
> dynamic).  If the solution means having a second core, so be it.
> 
> The following is the closest I found: 
> https://wiki.apache.org/solr/SolrSecurity#Document_Level_Security but I 
> cannot use Manifold CF (Connector Framework).  Does anyone know how Manifold 
> does it?
> 
> - MJ
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> Sent: Thursday, March 12, 2015 6:51 PM
> To: solr-user@lucene.apache.org
> Subject: RE: [Poll]: User need for Solr security
> 
> Jan - we don't really need any security for our products, nor for most 
> clients. However, one client does deal with very sensitive data so we 
> proposed to encrypt the transfer of data and the data on disk through a 
> Lucene Directory. It won't fill all gaps but it would adhere to such a 
> client's guidelines. 
> 
> I think many approaches of security in Solr/Lucene would find advocates, be 
> it index encryption or authentication/authorization or transport security, 
> which is now possible. I understand the reluctance of the PMC, and i agree 
> with it, but some users would definitately benefit and it would certainly 
> make Solr/Lucene the search platform to use for some enterprises.
> 
> Markus 
> 
> -Original message-
>> From:Henrique O. Santos 
>> Sent: Thursday 12th March 2015 23:43
>> To: solr-user@lucene.apache.org
>> Subject: Re: [Poll]: User need for Solr security
>> 
>> Hi,
>> 
>> I’m currently working with indexes that need document level security. Based 
>> on the user logged in, query results would omit documents that this user 
>> doesn’t have access to, with LDAP integration and such.
>> 
>> I think that would be nice to have on a future Solr release.
>> 
>> Henrique.
>> 
>>> On Mar 12, 2015, at 7:32 AM, Jan Høydahl  wrote:
>>> 
>>> Hi,
>>> 
>>> Securing various Solr APIs has once again surfaced as a discussion 
>>> in the developer list. See e.g. SOLR-7236 Would be useful to get some 
>>> feedback from Solr users about needs "in the field".
>>> 
>>> Please reply to this email and let us know what security aspect(s) would be 
>>> most important for your company to see supported in a future version of 
>>> Solr.
>>> Examples: Local user management, AD/LDAP integration, SSL, 
>>> authenticated login to Admin UI, authorization for Admin APIs, e.g. 
>>> admin user vs read-only user etc
>>> 
>>> --
>>> Jan Høydahl, search solution architect Cominvent AS - 
>>> www.cominvent.com
>>> 
>> 
>> 
> 



indexing db records via SolrJ

2015-03-16 Thread sreedevi s
Hi,

I am a beginner in Solr. I have a scenario, where I need to index data from
my MySQL db and need to query them. I have figured out to provide my db
data import configs using DIH. I also know to query my index via SolrJ.

How can I do indexing via SorJ client for my db as well other than reading
the db records into documents one by one?

This question is in point whether is there any way I can make use of my
configuration files and achieve the same. We need to use java APIs, so all
indexing and querying can be done only via SolrJ.
Best Regards,
Sreedevi S


Re: indexing db records via SolrJ

2015-03-16 Thread Mikhail Khludnev
Hello,

Did you see the great post http://lucidworks.com/blog/indexing-with-solrj/
?

On Mon, Mar 16, 2015 at 1:30 PM, sreedevi s 
wrote:

> Hi,
>
> I am a beginner in Solr. I have a scenario, where I need to index data from
> my MySQL db and need to query them. I have figured out to provide my db
> data import configs using DIH. I also know to query my index via SolrJ.
>
> How can I do indexing via SorJ client for my db as well other than reading
> the db records into documents one by one?
>
> This question is in point whether is there any way I can make use of my
> configuration files and achieve the same. We need to use java APIs, so all
> indexing and querying can be done only via SolrJ.
> Best Regards,
> Sreedevi S
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Solr returns incorrect results after sorting

2015-03-16 Thread kumarraj
Hi,

I am using group.sort to internally sort the values first based on
store(using function),then stock and finally distance and sort the output
results based on price, but solr does not return the correct results after
sorting.  
Below is the  sample query: 

q=*:*&start=0&rows=200&sort=pricecommon_double
desc&d=321&spatial=true&sfield=store_location&fl=geodist(),*&pt=37.1037311,-76.5104751&

group.ngroups=true&group.limit=1&group.facet=true&group.field=code_string&group=true&group.sort=max(if(exists(query({!v='storeName_string:212'})),2,0),if(exists(query({!v='storeName_string:203'})),1,0))
desc,inStock_boolean desc&geodist() asc


I am expecting all the docs to be sorted by price from high to low after
grouping,  but i see the records not matching the order, Do you see any
issues with the query or having functions in group.sort is not supported in
solr?




Regards,
Raj



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-returns-incorrect-results-after-sorting-tp4193266.html
Sent from the Solr - User mailing list archive at Nabble.com.


[ANNOUNCE] Luke 4.10.4 released

2015-03-16 Thread Dmitry Kan
Hello,

Luke 4.10.4 has been released. Download it here:

https://github.com/DmitryKey/luke/releases/tag/luke-4.10.4

The release has been tested against the solr-4.10.4 based index.

Changes:
Trivial pom upgrade to lucene 4.10.4.
Got rid of index version warning on the index summary tab
Luke is now distributed as a tar.gz with the luke binary and a launcher
script.


There is currently luke atop apache pivot cooking in its own branch. You
can try it out already for some basic index loading and search operations:

https://github.com/DmitryKey/luke/tree/pivot-luke

-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Re: [Poll]: User need for Solr security

2015-03-16 Thread Ahmet Arslan
Hi John,

ManifoldCF in Action book is publicly available to anyone : 
https://manifoldcfinaction.googlecode.com/svn/trunk/pdfs/

For solr integration please see :
https://svn.apache.org/repos/asf/manifoldcf/integration/solr-5.x/trunk/README.txt

Ahmet

On Friday, March 13, 2015 2:50 AM, "johnmu...@aol.com"  
wrote:



I would love to see record level (or even field level) restricted access in 
Solr / Lucene.

This should be group level, LDAP like or some rule base (which can be dynamic). 
 If the solution means having a second core, so be it.

The following is the closest I found: 
https://wiki.apache.org/solr/SolrSecurity#Document_Level_Security but I cannot 
use Manifold CF (Connector Framework).  Does anyone know how Manifold does it?

- MJ


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Thursday, March 12, 2015 6:51 PM
To: solr-user@lucene.apache.org
Subject: RE: [Poll]: User need for Solr security

Jan - we don't really need any security for our products, nor for most clients. 
However, one client does deal with very sensitive data so we proposed to 
encrypt the transfer of data and the data on disk through a Lucene Directory. 
It won't fill all gaps but it would adhere to such a client's guidelines. 

I think many approaches of security in Solr/Lucene would find advocates, be it 
index encryption or authentication/authorization or transport security, which 
is now possible. I understand the reluctance of the PMC, and i agree with it, 
but some users would definitately benefit and it would certainly make 
Solr/Lucene the search platform to use for some enterprises.

Markus 

-Original message-
> From:Henrique O. Santos 
> Sent: Thursday 12th March 2015 23:43
> To: solr-user@lucene.apache.org
> Subject: Re: [Poll]: User need for Solr security
> 
> Hi,
> 
> I’m currently working with indexes that need document level security. Based 
> on the user logged in, query results would omit documents that this user 
> doesn’t have access to, with LDAP integration and such.
> 
> I think that would be nice to have on a future Solr release.
> 
> Henrique.
> 
> > On Mar 12, 2015, at 7:32 AM, Jan Høydahl  wrote:
> > 
> > Hi,
> > 
> > Securing various Solr APIs has once again surfaced as a discussion 
> > in the developer list. See e.g. SOLR-7236 Would be useful to get some 
> > feedback from Solr users about needs "in the field".
> > 
> > Please reply to this email and let us know what security aspect(s) would be 
> > most important for your company to see supported in a future version of 
> > Solr.
> > Examples: Local user management, AD/LDAP integration, SSL, 
> > authenticated login to Admin UI, authorization for Admin APIs, e.g. 
> > admin user vs read-only user etc
> > 
> > --
> > Jan Høydahl, search solution architect Cominvent AS - 
> > www.cominvent.com
> > 
> 
>


Re: indexing db records via SolrJ

2015-03-16 Thread sreedevi s
Hi,
I had checked this post.I dont know whether this is possible but my query
is whether I can use the configuration for DIH for indexing via SolrJ

Best Regards,
Sreedevi S

On Mon, Mar 16, 2015 at 4:17 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Hello,
>
> Did you see the great post http://lucidworks.com/blog/indexing-with-solrj/
> ?
>
> On Mon, Mar 16, 2015 at 1:30 PM, sreedevi s 
> wrote:
>
> > Hi,
> >
> > I am a beginner in Solr. I have a scenario, where I need to index data
> from
> > my MySQL db and need to query them. I have figured out to provide my db
> > data import configs using DIH. I also know to query my index via SolrJ.
> >
> > How can I do indexing via SorJ client for my db as well other than
> reading
> > the db records into documents one by one?
> >
> > This question is in point whether is there any way I can make use of my
> > configuration files and achieve the same. We need to use java APIs, so
> all
> > indexing and querying can be done only via SolrJ.
> > Best Regards,
> > Sreedevi S
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> 
>


Re: Solr returns incorrect results after sorting

2015-03-16 Thread david.w.smi...@gmail.com
I noticed you have an ‘&’ immediately preceding the "geodist() asc" at the
very end of the query/URL; that’s supposed to be a comma since group.sort
is a comma delimited list of sorts.

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

On Mon, Mar 16, 2015 at 7:51 AM, kumarraj  wrote:

> Hi,
>
> I am using group.sort to internally sort the values first based on
> store(using function),then stock and finally distance and sort the output
> results based on price, but solr does not return the correct results after
> sorting.
> Below is the  sample query:
>
> q=*:*&start=0&rows=200&sort=pricecommon_double
>
> desc&d=321&spatial=true&sfield=store_location&fl=geodist(),*&pt=37.1037311,-76.5104751&
>
>
> group.ngroups=true&group.limit=1&group.facet=true&group.field=code_string&group=true&group.sort=max(if(exists(query({!v='storeName_string:212'})),2,0),if(exists(query({!v='storeName_string:203'})),1,0))
> desc,inStock_boolean desc&geodist() asc
>
>
> I am expecting all the docs to be sorted by price from high to low after
> grouping,  but i see the records not matching the order, Do you see any
> issues with the query or having functions in group.sort is not supported in
> solr?
>
>
>
>
> Regards,
> Raj
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-returns-incorrect-results-after-sorting-tp4193266.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: indexing db records via SolrJ

2015-03-16 Thread Shawn Heisey
On 3/16/2015 7:15 AM, sreedevi s wrote:
> I had checked this post.I dont know whether this is possible but my query
> is whether I can use the configuration for DIH for indexing via SolrJ

You can use SolrJ for accessing DIH.  I have code that does this, but
only for full index rebuilds.

It won't be particularly obvious how to do it.  Writing code that can
intepret DIH status and know when it finishes, succeeds, or fails is
very tricky because DIH only uses human-readable status info, not
machine-readable, and the info is not very consistent.

I can't just share my code, because it's extremely convoluted ... but
the general gist is to create a SolrQuery object, use setRequestHandler
to set the handler to "/dataimport" or whatever your DIH handler is, and
set the other parameters on the request like "command" to "full-import"
and so on.

Thanks,
Shawn



Re: indexing db records via SolrJ

2015-03-16 Thread Hal Roberts
We import anywhere from five to fifty million small documents a day from 
a postgres database.  I wrestled to get the DIH stuff to work for us for 
about a year and was much happier when I ditched that approach and 
switched to writing the few hundred lines of relatively simple code to 
handle directly the logic of what gets updated and how it gets queried 
from postgres ourselves.


The DIH stuff is great for lots of cases, but if you are getting to the 
point of trying to hack its undocumented internals, I suspect you are 
better off spending a day or two of your time just writing all of the 
update logic yourself.


We found a relatively simple combination of postgres triggers, export to 
csv based on those triggers, and then just calling update/csv to work 
best for us.


-hal

On 3/16/15 9:59 AM, Shawn Heisey wrote:

On 3/16/2015 7:15 AM, sreedevi s wrote:

I had checked this post.I dont know whether this is possible but my query
is whether I can use the configuration for DIH for indexing via SolrJ


You can use SolrJ for accessing DIH.  I have code that does this, but
only for full index rebuilds.

It won't be particularly obvious how to do it.  Writing code that can
intepret DIH status and know when it finishes, succeeds, or fails is
very tricky because DIH only uses human-readable status info, not
machine-readable, and the info is not very consistent.

I can't just share my code, because it's extremely convoluted ... but
the general gist is to create a SolrQuery object, use setRequestHandler
to set the handler to "/dataimport" or whatever your DIH handler is, and
set the other parameters on the request like "command" to "full-import"
and so on.

Thanks,
Shawn



--
Hal Roberts
Fellow
Berkman Center for Internet & Society
Harvard University


Solr Deleted Docs Issue

2015-03-16 Thread vicky desai
Hi,

I am having an issue with my solr setup. In my solr config I have set
following property
*10*

Now consider following situation. I have* 200* documents in my index. I need
to update all the 200 docs
If total commit operations I hit are* 20* i.e I update batches of 10 docs
merging is done after every 10th update and so the max Segment Count I can
have is 10 which is fine. However even when merging happens deleted docs are
not cleared and I end up with 100 deleted docs in index. 

If this operation is continuously done I would end up with a large set of
deleted docs which will affect the performance of the queries I hit on this
solr.

Can anyone please help me if I have missed a config or if this is an
expected behaviour



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Deleted-Docs-Issue-tp4193292.html
Sent from the Solr - User mailing list archive at Nabble.com.


Relevancy : Keyword stuffing

2015-03-16 Thread Mihran Shahinian
Hi all,
I have a use case where the data is generated by SEO minded authors and
more often than not
they perfectly guess the synonym expansions for the document titles skewing
results in their favor.
At the moment I don't have an offline processing infrastructure to detect
these (I can't punish these docs either... just have to level the playing
field).
I am experimenting with taking the max of the term scores, cutting off
scores after certain number of terms,etc but would appreciate any hints if
anyone has experience dealing with a similar use case in solr.

Much appreciated,
Mihran


thresholdTokenFrequency changes suggestion frequency..

2015-03-16 Thread Nitin Solanki
Hi,
  I am not getting that why suggestion frequency goes varies from
original frequency.
Example - I have a word = *who* and its original frequency is *100* but
when I find suggestion of it. It suggestion goes change to "*50*".

I think it is happening because of *thresholdTokenFrequency*.
When I set the value of thresholdTokenFrequency to *0.1* then it gives
different frequency for 'who' suggestion while  set the value of
thresholdTokenFrequency to *0.0001* then it gives something different
frequency. Why so? I am not getting logic behind this..

As we know suggestion frequency is same as the index original frequency -

*The spellcheck.extendedResults=true parameter provides frequency of each
original term in the index (origFreq) as well as the frequency of each
suggestion in the index (frequency).*


Re: Solr Deleted Docs Issue

2015-03-16 Thread Shawn Heisey
On 3/16/2015 9:11 AM, vicky desai wrote:
> I am having an issue with my solr setup. In my solr config I have set
> following property
> *10*

The mergeFactor setting is deprecated ... but you are setting it to the
default value of 10 anyway, so that's not really a big deal.  It's
possible that mergeFactor will no longer work in 5.0, but I'm not sure
on that.  You should instead use the settings specific to the merge
policy, which normally is TieredMergePolicy.

Note that when mergeFactor is 10, you *will* end up with more than 10
segments in your index.  There are multiple merge tiers, each one can
have up to 10 segments before it is merged.

> Now consider following situation. I have* 200* documents in my index. I need
> to update all the 200 docs
> If total commit operations I hit are* 20* i.e I update batches of 10 docs
> merging is done after every 10th update and so the max Segment Count I can
> have is 10 which is fine. However even when merging happens deleted docs are
> not cleared and I end up with 100 deleted docs in index. 
>
> If this operation is continuously done I would end up with a large set of
> deleted docs which will affect the performance of the queries I hit on this
> solr.

Because there are multiple merge tiers and you cannot easily
pre-determine which segments will be chosen for a particular merge, the
merge behavior may not be exactly what you expect.

The only guaranteed way to get rid of your deleted docs is to do an
optimize operation, which forces a merge of the entire index down to a
single segment.  This gets rid of all deleted docs in those segments. 
If you index more data while you are doing the optimize, then you may
end up with additional deleted docs.

Thanks,
Shawn



RE: Relevancy : Keyword stuffing

2015-03-16 Thread Markus Jelsma
Hello - setting (e)dismax' tie breaker to 0 or much low than default would 
`solve` this for now.
Markus 
 
-Original message-
> From:Mihran Shahinian 
> Sent: Monday 16th March 2015 16:29
> To: solr-user@lucene.apache.org
> Subject: Relevancy : Keyword stuffing
> 
> Hi all,
> I have a use case where the data is generated by SEO minded authors and
> more often than not
> they perfectly guess the synonym expansions for the document titles skewing
> results in their favor.
> At the moment I don't have an offline processing infrastructure to detect
> these (I can't punish these docs either... just have to level the playing
> field).
> I am experimenting with taking the max of the term scores, cutting off
> scores after certain number of terms,etc but would appreciate any hints if
> anyone has experience dealing with a similar use case in solr.
> 
> Much appreciated,
> Mihran
> 


Re: Whole RAM consumed while Indexing.

2015-03-16 Thread Erick Erickson
First start by lengthening your soft and hard commit intervals
substantially. Start with 6 and work backwards I'd say.

Ramkumar has tuned the heck out of his installation to get the commit
intervals to be that short ;).

I'm betting that you'll see your RAM usage go way down, but that' s a
guess until you test.

Best,
Erick

On Sun, Mar 15, 2015 at 10:56 PM, Nitin Solanki  wrote:
> Hi Erick,
> You are saying correct. Something, **"overlapping searchers"
> warning messages** are coming in logs.
> **numDocs numbers** are changing when documents are adding at the time of
> indexing.
> Any help?
>
> On Sat, Mar 14, 2015 at 11:24 PM, Erick Erickson 
> wrote:
>
>> First, the soft commit interval is very short. Very, very, very, very
>> short. 300ms is
>> just short of insane unless it's a typo ;).
>>
>> Here's a long background:
>>
>> https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>>
>> But the short form is that you're opening searchers every 300 ms. The
>> hard commit is better,
>> but every 3 seconds is still far too short IMO. I'd start with soft
>> commits of 6 and hard
>> commits of 6 (60 seconds), meaning that you're going to have to
>> wait 1 minute for
>> docs to show up unless you explicitly commit.
>>
>> You're throwing away all the caches configured in solrconfig.xml more
>> than 3 times a second,
>> executing autowarming, etc, etc, etc
>>
>> Changing these to longer intervals might cure the problem, but if not
>> then, as Hoss would
>> say, "details matter". I suspect you're also seeing "overlapping
>> searchers" warning messages
>> in your log, and it;s _possible_ that what's happening is that you're
>> just exceeding the
>> max warming searchers and never opening a new searcher with the
>> newly-indexed documents.
>> But that's a total shot in the dark.
>>
>> How are you looking for docs (and not finding them)? Does the numDocs
>> number in
>> the solr admin screen change?
>>
>>
>> Best,
>> Erick
>>
>> On Thu, Mar 12, 2015 at 10:27 PM, Nitin Solanki 
>> wrote:
>> > Hi Alexandre,
>> >
>> >
>> > *Hard Commit* is :
>> >
>> >  
>> >${solr.autoCommit.maxTime:3000}
>> >false
>> >  
>> >
>> > *Soft Commit* is :
>> >
>> > 
>> > ${solr.autoSoftCommit.maxTime:300}
>> > 
>> >
>> > And I am committing 2 documents each time.
>> > Is it good config for committing?
>> > Or I am good something wrong ?
>> >
>> >
>> > On Fri, Mar 13, 2015 at 8:52 AM, Alexandre Rafalovitch <
>> arafa...@gmail.com>
>> > wrote:
>> >
>> >> What's your commit strategy? Explicit commits? Soft commits/hard
>> >> commits (in solrconfig.xml)?
>> >>
>> >> Regards,
>> >>Alex.
>> >> 
>> >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> >> http://www.solr-start.com/
>> >>
>> >>
>> >> On 12 March 2015 at 23:19, Nitin Solanki  wrote:
>> >> > Hello,
>> >> >   I have written a python script to do 2 documents
>> indexing
>> >> > each time on Solr. I have 28 GB RAM with 8 CPU.
>> >> > When I started indexing, at that time 15 GB RAM was freed. While
>> >> indexing,
>> >> > all RAM is consumed but **not** a single document is indexed. Why so?
>> >> > And it through *HTTPError: HTTP Error 503: Service Unavailable* in
>> python
>> >> > script.
>> >> > I think it is due to heavy load on Zookeeper by which all nodes went
>> >> down.
>> >> > I am not sure about that. Any help please..
>> >> > Or anything else is happening..
>> >> > And how to overcome this issue.
>> >> > Please assist me towards right path.
>> >> > Thanks..
>> >> >
>> >> > Warm Regards,
>> >> > Nitin Solanki
>> >>
>>


Re: Solr Deleted Docs Issue

2015-03-16 Thread Erick Erickson
bq: If this operation is continuously done I would end up with a large set of
deleted docs which will affect the performance of the queries I hit on this
solr.

No, you won't. They'll be "merged away" as background segments are merged.
Here's a great visualization of the process, the third one down is the
default TieredMergePolicy.

In general, even in the case of replacing all the docs, you'll have 10% of your
corpus be deleted docs. The % of deleted docs in a segment weighs quite
heavily when it comest to the decision of which segment to merge (note that
merging purges the deleted docs).

Also in general, the results of small tests like this simply do not generalize.
i.e. the number of deleted docs in a 200 doc sample size can't be
extrapolated to a reasonable-sized corpus.

Finally, I don't know if this is something temporary, but the implication of
"If total commit operations I hit are 20" is that you're committing after every
batch of docs is sent to Solr. You should not do this, let your autocommit
settings handle this.

Here's Mike's blog:
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

Best,
Erick

On Mon, Mar 16, 2015 at 8:51 AM, Shawn Heisey  wrote:
> On 3/16/2015 9:11 AM, vicky desai wrote:
>> I am having an issue with my solr setup. In my solr config I have set
>> following property
>> *10*
>
> The mergeFactor setting is deprecated ... but you are setting it to the
> default value of 10 anyway, so that's not really a big deal.  It's
> possible that mergeFactor will no longer work in 5.0, but I'm not sure
> on that.  You should instead use the settings specific to the merge
> policy, which normally is TieredMergePolicy.
>
> Note that when mergeFactor is 10, you *will* end up with more than 10
> segments in your index.  There are multiple merge tiers, each one can
> have up to 10 segments before it is merged.
>
>> Now consider following situation. I have* 200* documents in my index. I need
>> to update all the 200 docs
>> If total commit operations I hit are* 20* i.e I update batches of 10 docs
>> merging is done after every 10th update and so the max Segment Count I can
>> have is 10 which is fine. However even when merging happens deleted docs are
>> not cleared and I end up with 100 deleted docs in index.
>>
>> If this operation is continuously done I would end up with a large set of
>> deleted docs which will affect the performance of the queries I hit on this
>> solr.
>
> Because there are multiple merge tiers and you cannot easily
> pre-determine which segments will be chosen for a particular merge, the
> merge behavior may not be exactly what you expect.
>
> The only guaranteed way to get rid of your deleted docs is to do an
> optimize operation, which forces a merge of the entire index down to a
> single segment.  This gets rid of all deleted docs in those segments.
> If you index more data while you are doing the optimize, then you may
> end up with additional deleted docs.
>
> Thanks,
> Shawn
>


Re: Solr tlog and soft commit

2015-03-16 Thread vidit.asthana
Can someone please reply to these questions? 

Thanks in advance.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-tlog-and-soft-commit-tp4193105p4193311.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: indexing db records via SolrJ

2015-03-16 Thread mike st. john
Take a look at some of the integrations people are using with apache storm,
  we do something similar on a larger scale , having created a pgsql spout
and having a solr indexing bolt.


-msj

On Mon, Mar 16, 2015 at 11:08 AM, Hal Roberts <
hrobe...@cyber.law.harvard.edu> wrote:

> We import anywhere from five to fifty million small documents a day from a
> postgres database.  I wrestled to get the DIH stuff to work for us for
> about a year and was much happier when I ditched that approach and switched
> to writing the few hundred lines of relatively simple code to handle
> directly the logic of what gets updated and how it gets queried from
> postgres ourselves.
>
> The DIH stuff is great for lots of cases, but if you are getting to the
> point of trying to hack its undocumented internals, I suspect you are
> better off spending a day or two of your time just writing all of the
> update logic yourself.
>
> We found a relatively simple combination of postgres triggers, export to
> csv based on those triggers, and then just calling update/csv to work best
> for us.
>
> -hal
>
>
> On 3/16/15 9:59 AM, Shawn Heisey wrote:
>
>> On 3/16/2015 7:15 AM, sreedevi s wrote:
>>
>>> I had checked this post.I dont know whether this is possible but my query
>>> is whether I can use the configuration for DIH for indexing via SolrJ
>>>
>>
>> You can use SolrJ for accessing DIH.  I have code that does this, but
>> only for full index rebuilds.
>>
>> It won't be particularly obvious how to do it.  Writing code that can
>> intepret DIH status and know when it finishes, succeeds, or fails is
>> very tricky because DIH only uses human-readable status info, not
>> machine-readable, and the info is not very consistent.
>>
>> I can't just share my code, because it's extremely convoluted ... but
>> the general gist is to create a SolrQuery object, use setRequestHandler
>> to set the handler to "/dataimport" or whatever your DIH handler is, and
>> set the other parameters on the request like "command" to "full-import"
>> and so on.
>>
>> Thanks,
>> Shawn
>>
>>
> --
> Hal Roberts
> Fellow
> Berkman Center for Internet & Society
> Harvard University
>


Re: Relevancy : Keyword stuffing

2015-03-16 Thread Chris Hostetter

You should start by checking out the "SweetSpotSimilarity" .. it was 
heavily designed arround the idea of dealing with things like excessively 
verbose titles, and keyword stuffing in summary text ... so you can 
configure your expectation for what a "normal" length doc is, and they 
will be penalized for being longer then that.  similarly you can say what 
a 'resaonable' tf is, and docs that exceed that would't get added boost 
(which in conjunction with teh lengthNorm penality penalizes docs that 
stuff keywords)

https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html

https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg
https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg


-Hoss
http://www.lucidworks.com/


Data Import Handler - reading GET

2015-03-16 Thread Kiran J
Hi,

In data import handler, I can read the "clean" query parameter using
${dih.request.clean} and pass it on to the queries. Is it possible to read
any query parameter from the URL ? for eg ${foo} ?

Thanks


Re: indexing db records via SolrJ

2015-03-16 Thread Jean-Sebastien Vachon
Do you have any references to such integrations (Solr + Storm)?

Thanks


From: mike st. john 
Sent: Monday, March 16, 2015 2:39 PM
To: solr-user@lucene.apache.org
Subject: Re: indexing db records via SolrJ

Take a look at some of the integrations people are using with apache storm,
  we do something similar on a larger scale , having created a pgsql spout
and having a solr indexing bolt.


-msj

On Mon, Mar 16, 2015 at 11:08 AM, Hal Roberts <
hrobe...@cyber.law.harvard.edu> wrote:

> We import anywhere from five to fifty million small documents a day from a
> postgres database.  I wrestled to get the DIH stuff to work for us for
> about a year and was much happier when I ditched that approach and switched
> to writing the few hundred lines of relatively simple code to handle
> directly the logic of what gets updated and how it gets queried from
> postgres ourselves.
>
> The DIH stuff is great for lots of cases, but if you are getting to the
> point of trying to hack its undocumented internals, I suspect you are
> better off spending a day or two of your time just writing all of the
> update logic yourself.
>
> We found a relatively simple combination of postgres triggers, export to
> csv based on those triggers, and then just calling update/csv to work best
> for us.
>
> -hal
>
>
> On 3/16/15 9:59 AM, Shawn Heisey wrote:
>
>> On 3/16/2015 7:15 AM, sreedevi s wrote:
>>
>>> I had checked this post.I dont know whether this is possible but my query
>>> is whether I can use the configuration for DIH for indexing via SolrJ
>>>
>>
>> You can use SolrJ for accessing DIH.  I have code that does this, but
>> only for full index rebuilds.
>>
>> It won't be particularly obvious how to do it.  Writing code that can
>> intepret DIH status and know when it finishes, succeeds, or fails is
>> very tricky because DIH only uses human-readable status info, not
>> machine-readable, and the info is not very consistent.
>>
>> I can't just share my code, because it's extremely convoluted ... but
>> the general gist is to create a SolrQuery object, use setRequestHandler
>> to set the handler to "/dataimport" or whatever your DIH handler is, and
>> set the other parameters on the request like "command" to "full-import"
>> and so on.
>>
>> Thanks,
>> Shawn
>>
>>
> --
> Hal Roberts
> Fellow
> Berkman Center for Internet & Society
> Harvard University
>


Re: Data Import Handler - reading GET

2015-03-16 Thread Alexandre Rafalovitch
Have you tried? As ${dih.request.foo}?

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 16 March 2015 at 14:51, Kiran J  wrote:
> Hi,
>
> In data import handler, I can read the "clean" query parameter using
> ${dih.request.clean} and pass it on to the queries. Is it possible to read
> any query parameter from the URL ? for eg ${foo} ?
>
> Thanks


Nginx proxy for Solritas

2015-03-16 Thread LongY
Dear Community Members,

I have searched over the forum and googled a lot, still didn't find the
solution. Finally got me here for help.
 
I am implementing a Nginx reverse proxy for Solritas
(VelocityResponseWriter) of the example included in Solr.
. Nginx listens on port 80, and solr runs on port 8983. This is my Nginx
configuration file (It only permits localhost
to access the browse request handler).

*location ~* /solr/\w+/browse {
   proxy_pass  http://localhost:8983;

allow   127.0.0.1;
denyall;

proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
  
}*

when I input http://localhost/solr/collection1/browse in the browser address
bar. 
 The output I got is this. 
 
The supposed output should be like this 
 

I tested the Admin page with this Nginx configuration file with some minor
modifications, it worked well,
but when used in velocity templates, it did not render the output properly.
 
Any input is welcome.
Thank you.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nginx-proxy-for-Solritas-tp4193346.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Nginx proxy for Solritas

2015-03-16 Thread Erik Hatcher
The links to the screenshots aren’t working for me.  I’m not sure what the 
issue is - but do be aware that /browse with its out of the box templates do 
refer to resources (CSS, images, JavaScript) that isn’t under /browse, so 
you’ll need to allow those to be accessible as well with different rules.


—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com 




> On Mar 16, 2015, at 3:39 PM, LongY  wrote:
> 
> Dear Community Members,
> 
> I have searched over the forum and googled a lot, still didn't find the
> solution. Finally got me here for help.
> 
> I am implementing a Nginx reverse proxy for Solritas
> (VelocityResponseWriter) of the example included in Solr.
> . Nginx listens on port 80, and solr runs on port 8983. This is my Nginx
> configuration file (It only permits localhost
> to access the browse request handler).
> 
> *location ~* /solr/\w+/browse {
>   proxy_pass  http://localhost:8983;
> 
>allow   127.0.0.1;
>denyall;
> 
>proxy_set_header X-Real-IP $remote_addr;
>proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
>proxy_set_header Host $http_host;
> 
>}*
> 
> when I input http://localhost/solr/collection1/browse in the browser address
> bar. 
> The output I got is this. 
>  
> The supposed output should be like this 
>  
> 
> I tested the Admin page with this Nginx configuration file with some minor
> modifications, it worked well,
> but when used in velocity templates, it did not render the output properly.
> 
> Any input is welcome.
> Thank you.
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Nginx-proxy-for-Solritas-tp4193346.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Nginx proxy for Solritas

2015-03-16 Thread LongY
Thank you for the reply.

I also thought the relevant resources (CSS, images, JavaScript) need to 
be accessible for Nginx. 

I copied the velocity folder to solr-webapp/webapp folder. It didn't work.

So how to allow /browse resource accessible by the Nginx rule?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nginx-proxy-for-Solritas-tp4193347p4193352.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Whole RAM consumed while Indexing.

2015-03-16 Thread Ramkumar R. Aiyengar
Yes, and doing so is painful and takes lots of people and hardware
resources to get there for large amounts of data and queries :)

As Erick says, work backwards from 60s and first establish how high the
commit interval can be to satisfy your use case..
On 16 Mar 2015 16:04, "Erick Erickson"  wrote:

> First start by lengthening your soft and hard commit intervals
> substantially. Start with 6 and work backwards I'd say.
>
> Ramkumar has tuned the heck out of his installation to get the commit
> intervals to be that short ;).
>
> I'm betting that you'll see your RAM usage go way down, but that' s a
> guess until you test.
>
> Best,
> Erick
>
> On Sun, Mar 15, 2015 at 10:56 PM, Nitin Solanki 
> wrote:
> > Hi Erick,
> > You are saying correct. Something, **"overlapping searchers"
> > warning messages** are coming in logs.
> > **numDocs numbers** are changing when documents are adding at the time of
> > indexing.
> > Any help?
> >
> > On Sat, Mar 14, 2015 at 11:24 PM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> >> First, the soft commit interval is very short. Very, very, very, very
> >> short. 300ms is
> >> just short of insane unless it's a typo ;).
> >>
> >> Here's a long background:
> >>
> >>
> https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> >>
> >> But the short form is that you're opening searchers every 300 ms. The
> >> hard commit is better,
> >> but every 3 seconds is still far too short IMO. I'd start with soft
> >> commits of 6 and hard
> >> commits of 6 (60 seconds), meaning that you're going to have to
> >> wait 1 minute for
> >> docs to show up unless you explicitly commit.
> >>
> >> You're throwing away all the caches configured in solrconfig.xml more
> >> than 3 times a second,
> >> executing autowarming, etc, etc, etc
> >>
> >> Changing these to longer intervals might cure the problem, but if not
> >> then, as Hoss would
> >> say, "details matter". I suspect you're also seeing "overlapping
> >> searchers" warning messages
> >> in your log, and it;s _possible_ that what's happening is that you're
> >> just exceeding the
> >> max warming searchers and never opening a new searcher with the
> >> newly-indexed documents.
> >> But that's a total shot in the dark.
> >>
> >> How are you looking for docs (and not finding them)? Does the numDocs
> >> number in
> >> the solr admin screen change?
> >>
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Mar 12, 2015 at 10:27 PM, Nitin Solanki 
> >> wrote:
> >> > Hi Alexandre,
> >> >
> >> >
> >> > *Hard Commit* is :
> >> >
> >> >  
> >> >${solr.autoCommit.maxTime:3000}
> >> >false
> >> >  
> >> >
> >> > *Soft Commit* is :
> >> >
> >> > 
> >> > ${solr.autoSoftCommit.maxTime:300}
> >> > 
> >> >
> >> > And I am committing 2 documents each time.
> >> > Is it good config for committing?
> >> > Or I am good something wrong ?
> >> >
> >> >
> >> > On Fri, Mar 13, 2015 at 8:52 AM, Alexandre Rafalovitch <
> >> arafa...@gmail.com>
> >> > wrote:
> >> >
> >> >> What's your commit strategy? Explicit commits? Soft commits/hard
> >> >> commits (in solrconfig.xml)?
> >> >>
> >> >> Regards,
> >> >>Alex.
> >> >> 
> >> >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> >> >> http://www.solr-start.com/
> >> >>
> >> >>
> >> >> On 12 March 2015 at 23:19, Nitin Solanki 
> wrote:
> >> >> > Hello,
> >> >> >   I have written a python script to do 2 documents
> >> indexing
> >> >> > each time on Solr. I have 28 GB RAM with 8 CPU.
> >> >> > When I started indexing, at that time 15 GB RAM was freed. While
> >> >> indexing,
> >> >> > all RAM is consumed but **not** a single document is indexed. Why
> so?
> >> >> > And it through *HTTPError: HTTP Error 503: Service Unavailable* in
> >> python
> >> >> > script.
> >> >> > I think it is due to heavy load on Zookeeper by which all nodes
> went
> >> >> down.
> >> >> > I am not sure about that. Any help please..
> >> >> > Or anything else is happening..
> >> >> > And how to overcome this issue.
> >> >> > Please assist me towards right path.
> >> >> > Thanks..
> >> >> >
> >> >> > Warm Regards,
> >> >> > Nitin Solanki
> >> >>
> >>
>


Re: Relevancy : Keyword stuffing

2015-03-16 Thread Mihran Shahinian
Thank you Markus and Chris, for pointers.
For SweetSpotSimilarity I am thinking perhaps a set of closed ranges
exposed via similarity config is easier to maintain as data changes than
making adjustments to fit a
function. Another piece of info would've been handy is to know the average
position info + position info for the first few occurrences for each term.
This would allow
perhaps higher boosting for term occurrences earlier in the doc. In my case
extra keywords are towards the end of the doc,but that info does not seem
to be propagated into scorer.
Thanks again,
Mihran



On Mon, Mar 16, 2015 at 1:52 PM, Chris Hostetter 
wrote:

>
> You should start by checking out the "SweetSpotSimilarity" .. it was
> heavily designed arround the idea of dealing with things like excessively
> verbose titles, and keyword stuffing in summary text ... so you can
> configure your expectation for what a "normal" length doc is, and they
> will be penalized for being longer then that.  similarly you can say what
> a 'resaonable' tf is, and docs that exceed that would't get added boost
> (which in conjunction with teh lengthNorm penality penalizes docs that
> stuff keywords)
>
>
> https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html
>
>
> https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg
>
> https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg
>
>
> -Hoss
> http://www.lucidworks.com/
>


Re: Nginx proxy for Solritas

2015-03-16 Thread Shawn Heisey
On 3/16/2015 2:42 PM, LongY wrote:
> Thank you for the reply.
>
> I also thought the relevant resources (CSS, images, JavaScript) need to 
> be accessible for Nginx. 
>
> I copied the velocity folder to solr-webapp/webapp folder. It didn't work.
>
> So how to allow /browse resource accessible by the Nginx rule?

The /browse handler causes your browser to make requests directly to
Solr on handlers other than /browse.  You must figure out what those
requests are and allow them in the proxy configuration.  I do not know
whether they are relative URLs ... I would not be terribly surprised to
learn that they have port 8983 in them rather than the port 80 on your
proxy.  Hopefully that's not the case, or you'll really have problems
making it work on port 80.

I've never spent any real time with the /browse handler.  Requiring
direct access to Solr is completely unacceptable for us.

Thanks,
Shawn



RE: Relevancy : Keyword stuffing

2015-03-16 Thread Markus Jelsma
Hello - Chris' suggestion is indeed a good one but it can be tricky to properly 
configure the parameters. Regarding position information, you can override 
dismax to have it use SpanFirstQuery. It allows for setting strict boundaries 
from the front of the document to a given position. You can also override 
SpanFirstQuery to incorporate a gradient, to decrease boosting as distance from 
the front increases.

I don't know how you ingest document bodies, but if they are unstructured HTML, 
you may want to install proper main content extraction if you haven't already. 
Having decent control over HTML is a powerful tool.

You may also want to look at Lucene's BM25 implementation. It is simple to set 
up and easier to control. It isn't as rough a tool as TFIDF is regarding to 
length normalization. Plus it allows you to smooth TF, which in your case 
should also help.

If you like to scrutinize SSS and get some proper results, you are more than 
welcome to share them here :)

Markus
 
-Original message-
> From:Mihran Shahinian 
> Sent: Monday 16th March 2015 22:41
> To: solr-user@lucene.apache.org
> Subject: Re: Relevancy : Keyword stuffing
> 
> Thank you Markus and Chris, for pointers.
> For SweetSpotSimilarity I am thinking perhaps a set of closed ranges
> exposed via similarity config is easier to maintain as data changes than
> making adjustments to fit a
> function. Another piece of info would've been handy is to know the average
> position info + position info for the first few occurrences for each term.
> This would allow
> perhaps higher boosting for term occurrences earlier in the doc. In my case
> extra keywords are towards the end of the doc,but that info does not seem
> to be propagated into scorer.
> Thanks again,
> Mihran
> 
> 
> 
> On Mon, Mar 16, 2015 at 1:52 PM, Chris Hostetter 
> wrote:
> 
> >
> > You should start by checking out the "SweetSpotSimilarity" .. it was
> > heavily designed arround the idea of dealing with things like excessively
> > verbose titles, and keyword stuffing in summary text ... so you can
> > configure your expectation for what a "normal" length doc is, and they
> > will be penalized for being longer then that.  similarly you can say what
> > a 'resaonable' tf is, and docs that exceed that would't get added boost
> > (which in conjunction with teh lengthNorm penality penalizes docs that
> > stuff keywords)
> >
> >
> > https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html
> >
> >
> > https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg
> >
> > https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg
> >
> >
> > -Hoss
> > http://www.lucidworks.com/
> >
> 


discrepancy between LuceneQParser and ExtendedDismaxQParser

2015-03-16 Thread Arsen
Hello,

Found discrepancy between LuceneQParser and ExtendedDismaxQParser when 
executing following query:
((*:* AND -area) OR area:[100 TO 300]) AND objectId:40105451

When executing it through Solr Admin panel and placing query in "q" field I 
having following debug output for LuceneQParser   
--
"debug": {
"rawquerystring": "((*:* AND -area) OR area:[100 TO 300]) AND 
objectId:40105451",
"querystring": "((*:* AND -area) OR area:[100 TO 300]) AND 
objectId:40105451",
"parsedquery": "+((+MatchAllDocsQuery(*:*) -text:area) area:[100 TO 300]) 
+objectId:40105451",
"parsedquery_toString": "+((+*:* -text:area) area:[100 TO 300]) +objectId: 
\u0001\u\u\u\u\u\u0013\u000fkk",
"explain": {
  "40105451": "\n14.3511 = (MATCH) sum of:\n  0.034590416 = (MATCH) product 
of:\n0.06918083 = (MATCH) sum of:\n  0.06918083 = (MATCH) sum of:\n 
   0.06918083 = (MATCH) MatchAllDocsQuery, product of:\n  0.06918083 = 
queryNorm\n0.5 = coord(1/2)\n  14.316509 = (MATCH) weight(objectId: 
\u0001\u\u\u\u\u\u0013\u000fkk in 1109978) 
[DefaultSimilarity], result of:\n14.316509 = score(doc=1109978,freq=1.0), 
product of:\n  0.9952025 = queryWeight, product of:\n14.385524 = 
idf(docFreq=1, maxDocs=1301035)\n0.06918083 = queryNorm\n  
14.385524 = fieldWeight in 1109978, product of:\n1.0 = tf(freq=1.0), 
with freq of:\n  1.0 = termFreq=1.0\n14.385524 = idf(docFreq=1, 
maxDocs=1301035)\n1.0 = fieldNorm(doc=1109978)\n"
},
--
So, one object found which is expectable

For ExtendedDismaxQParser (only difference is checkbox "edismax" checked) I am 
seeing this output
--
"debug": {
"rawquerystring": "((*:* AND -area) OR area:[100 TO 300]) AND 
objectId:40105451",
"querystring": "((*:* AND -area) OR area:[100 TO 300]) AND 
objectId:40105451",
"parsedquery": "(+(+((+DisjunctionMaxQuery((text:*\\:*)) 
-DisjunctionMaxQuery((text:area))) area:[100 TO 300]) 
+objectId:40105451))/no_coord",
"parsedquery_toString": "+(+((+(text:*\\:*) -(text:area)) area:[100 TO 
300]) +objectId: \u0001\u\u\u\u\u\u0013\u000fkk)",
"explain": {},
--
oops, no objects found!

I hastened to fill https://issues.apache.org/jira/browse/SOLR-7249 (sorry, my 
bad)
You may refer to it for additional info (not going to duplicate it here)

Thanks

-- 
Best regards,
 Arsen  mailto:barracuda...@mail.ru



Re: discrepancy between LuceneQParser and ExtendedDismaxQParser

2015-03-16 Thread Jack Krupansky
There was a Solr release with a bug that required that you put a space
between the left parenthesis and the "*:*". The edismax parsed query here
indicates that the "*:*" has not parsed properly.

You have "area", but in your jira you had a range query.

-- Jack Krupansky

On Mon, Mar 16, 2015 at 6:42 PM, Arsen  wrote:

> Hello,
>
> Found discrepancy between LuceneQParser and ExtendedDismaxQParser when
> executing following query:
> ((*:* AND -area) OR area:[100 TO 300]) AND objectId:40105451
>
> When executing it through Solr Admin panel and placing query in "q" field
> I having following debug output for LuceneQParser
> --
> "debug": {
> "rawquerystring": "((*:* AND -area) OR area:[100 TO 300]) AND
> objectId:40105451",
> "querystring": "((*:* AND -area) OR area:[100 TO 300]) AND
> objectId:40105451",
> "parsedquery": "+((+MatchAllDocsQuery(*:*) -text:area) area:[100 TO
> 300]) +objectId:40105451",
> "parsedquery_toString": "+((+*:* -text:area) area:[100 TO 300])
> +objectId: \u0001\u\u\u\u\u\u0013\u000fkk",
> "explain": {
>   "40105451": "\n14.3511 = (MATCH) sum of:\n  0.034590416 = (MATCH)
> product of:\n0.06918083 = (MATCH) sum of:\n  0.06918083 = (MATCH)
> sum of:\n0.06918083 = (MATCH) MatchAllDocsQuery, product of:\n
> 0.06918083 = queryNorm\n0.5 = coord(1/2)\n  14.316509 = (MATCH)
> weight(objectId: \u0001\u\u\u\u\u\u0013\u000fkk in
> 1109978) [DefaultSimilarity], result of:\n14.316509 =
> score(doc=1109978,freq=1.0), product of:\n  0.9952025 = queryWeight,
> product of:\n14.385524 = idf(docFreq=1, maxDocs=1301035)\n
> 0.06918083 = queryNorm\n  14.385524 = fieldWeight in 1109978, product
> of:\n1.0 = tf(freq=1.0), with freq of:\n  1.0 =
> termFreq=1.0\n14.385524 = idf(docFreq=1, maxDocs=1301035)\n
> 1.0 = fieldNorm(doc=1109978)\n"
> },
> --
> So, one object found which is expectable
>
> For ExtendedDismaxQParser (only difference is checkbox "edismax" checked)
> I am seeing this output
> --
> "debug": {
> "rawquerystring": "((*:* AND -area) OR area:[100 TO 300]) AND
> objectId:40105451",
> "querystring": "((*:* AND -area) OR area:[100 TO 300]) AND
> objectId:40105451",
> "parsedquery": "(+(+((+DisjunctionMaxQuery((text:*\\:*))
> -DisjunctionMaxQuery((text:area))) area:[100 TO 300])
> +objectId:40105451))/no_coord",
> "parsedquery_toString": "+(+((+(text:*\\:*) -(text:area)) area:[100 TO
> 300]) +objectId: \u0001\u\u\u\u\u\u0013\u000fkk)",
> "explain": {},
> --
> oops, no objects found!
>
> I hastened to fill https://issues.apache.org/jira/browse/SOLR-7249
> (sorry, my bad)
> You may refer to it for additional info (not going to duplicate it here)
>
> Thanks
>
> --
> Best regards,
>  Arsen  mailto:barracuda...@mail.ru
>
>


Re: Nginx proxy for Solritas

2015-03-16 Thread Erik Hatcher
Have a look at the requests being made to Solr while using /browse (without 
nginx) and that will show you what resources need to be accessible.


—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com 




> On Mar 16, 2015, at 4:42 PM, LongY  wrote:
> 
> Thank you for the reply.
> 
> I also thought the relevant resources (CSS, images, JavaScript) need to 
> be accessible for Nginx. 
> 
> I copied the velocity folder to solr-webapp/webapp folder. It didn't work.
> 
> So how to allow /browse resource accessible by the Nginx rule?
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Nginx-proxy-for-Solritas-tp4193347p4193352.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Nginx proxy for Solritas

2015-03-16 Thread LongY
Thanks to Erik and Shawn, I figured out the solution.

* place main.css in velocity folder into
/usr/share/nginx/html/solr/collection1/admin/file/
* don't forget to change the permission of main.css by sudo chmod 755
main.css
* add main.css to the configuration file of Ngix:
server {
listen 80 default_server;
listen [::]:80 default_server ipv6only=on;
index main.css;
server_name localhost;
location ~* /solr/\w+/browse {
proxy_pass   http://localhost:8983; 
allow   127.0.0.1;
denyall;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
 
} 
}
That will work.
Also /var/log/nginx/error.log is good for debugging.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nginx-proxy-for-Solritas-tp4193347p4193415.html
Sent from the Solr - User mailing list archive at Nabble.com.


maxQueryFrequency v/s thresholdTokenFrequency

2015-03-16 Thread Nitin Solanki
Hello Everyone,
 Please anybody can explain me what is the
difference between maxQueryFrequency and thresholdTokenFrequency?
Got the link -
http://wiki.apache.org/solr/SpellCheckComponent#thresholdTokenFrequency but
unable to understand..
I am very much confusing in both of them.
Your help is appreciated.


Warm Regards,
Nitin


Admin extra menu becomes invisible

2015-03-16 Thread Dikshant Shahi
Hi,

I uncommented the html tags in admin-extra.menu-top and
admin-extra.menu-bottom. It works fine when I select the core from the
dropdown but once I click on any other tab like Replication, Dataimport
etc, it disappears.

I tried it in Solr 4.6.1 and Solr 5.0.0 and the behavior is same.

I could see there is a fix in JIRA issue 4405
 but I don't see it
working.

Am wondering if am missing something.

Thanks,
Dikshant


Want to modify Solr Source Code

2015-03-16 Thread Nitin Solanki
Hi,
 I want to modify the solr source code. I don't have any idea where
source code is available. I want to edit source code. How can I do ?
Any help please...


Re: Unable to find query result in solr 5.0.0

2015-03-16 Thread rupak
Hi Jack Krupansky,

We are following the apache-solr-ref-guide-5.0 doc for installation. But
when I am going to execute Post command via "$ bin/post -c gettingstarted
example/exampledocs/*.json" in Command prompt  *"'post' is not recognized as
an internal or external command, operable program or batch file."*  This
error comes from Command prompt. Now we are unable to proceed as because we
can not post any documents in Solr. 

Can you please help us  to overcome this scenario. As we got the information
that we need to post documents in order to search any query result so please
let us know the process to do so.


--
Thanks & Regards,
Rupak.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unable-to-find-query-result-in-solr-5-0-0-tp4189196p4193434.html
Sent from the Solr - User mailing list archive at Nabble.com.