date:20130528

Re: FieldCache insanity with field used as facet and group

2013-05-28 Thread Elodie Sannier


I've created https://issues.apache.org/jira/browse/SOLR-4866

Elodie

Le 07.05.2013 18:19, Chris Hostetter a écrit :

: I am using the Lucene FieldCache with SolrCloud and I have "insane" instances
: with messages like:

FWIW: I'm the one that named the result of these "sanity checks"
"FieldCacheInsantity" and i have regretted it ever since -- a better label
would have been "inconsistency"

: VALUEMISMATCH: Multiple distinct value objects for
: SegmentCoreReader(owner=_11i(4.2.1):C4493997/853637)+merchantid
: 'SegmentCoreReader(owner=_11i(4.2.1):C4493997/853637)'=>'merchantid',class
: 
org.apache.lucene.index.SortedDocValues,0.5=>org.apache.lucene.search.FieldCacheImpl$SortedDocValuesImpl#557711353
: 
'SegmentCoreReader(owner=_11i(4.2.1):C4493997/853637)'=>'merchantid',int,null=>org.apache.lucene.search.FieldCacheImpl$IntsFromArray#1105988713
: 
'SegmentCoreReader(owner=_11i(4.2.1):C4493997/853637)'=>'merchantid',int,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_INT_PARSER=>org.apache.lucene.search.FieldCacheImpl$IntsFromArray#1105988713
:
: All insane instances are for a field "merchantid" of type "int" used as facet
: and group field.

Interesting: it appears that the grouping code and the facet code are not
being consistent in how they are building hte field cache, so you are
getting two objects in the cache for each segment

I haven't checked if this happens much with the example configs, but if
you could: please file a bug with the details of which Solr version you
are using along with the schema fieldType&  filed declarations for your
merchantid field, along with the mbean stats output showing the field
cache insanity after executing two queries like...

/select?q=*:*&facet=true&facet.field=merchantid
/select?q=*:*&group=true&group.field=merchantid

(that way we can rule out your custom SearchComponent as having a bug in
it)

: This insanity can have performance impact ?
: How can I fix it ?

the impact is just that more ram is being used them is probably strictly
neccessary.  unless there is something unusual in your fieldType
delcataion, i don't think there is an easy fix you can apply -- we need to
fix the underlying code.

-Hoss



--
Kelkoo

*Elodie Sannier *Software engineer

*E*elodie.sann...@kelkoo.fr 
*Y!Messenger* kelkooelodies
*T* +33 (0)4 56 09 07 55 *M*
*A* 4/6 Rue des Méridiens 38130 Echirolles




Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.

Strange behavior on text field with number-text content

2013-05-28 Thread Michał Matulka


Hello,

I've got following problem. I have a text type in my schema and a field 
"name" of that type.
That field contains a data, there is, for example, record that has 
"300letters" as name.


Now field type definition:


And, of course, field definition:


yes, that's all - there are no tokenizers.

And now time for my question:

Why following queries:

name:300

and

name:letters

are returning that result, but:

name:300letters

is not (0 results)?

Best regards,
Michał Matulka

Re: Indexing message module

2013-05-28 Thread Upayavira

Switching from single to multivalued shouldn't cause your index to break
(but your app might not like it).

Do you have a deduplication issue, or does each message have a unique
ID? You might be able to use the DedupUpdateProcessorFactory to prevent
updates to an existing message getting into the index.

Upayavira 

On Tue, May 28, 2013, at 07:44 AM, Arkadi Colson wrote:
> Is it ok to just change the multivalue attribute to true and reindex the 
> message module data? There are also other modules indexed on the same 
> schema with multivalued = false. Will it become a problem?
> 
> BR,
> Arkadi
> 
> On 05/27/2013 09:33 AM, Gora Mohanty wrote:
> > On 27 May 2013 12:58, Arkadi Colson  wrote:
> >> Hi
> >>
> >> We would like to index our messages system. We should be able to search for
> >> messages for specific recipients due to performance issues on our 
> >> databases.
> >> But the message is of course the same for all receipients and the message
> >> text should be saved only once! Is it possible to have some kind of array
> >> field to include in the search query where all the recipients are stored? 
> >> Or
> >> should we for example use a simple text field which is filled with the
> >> receipients like this: _434_3432_432_6546_75_8678_
> > [...]
> >
> > Why couldn't you use a multi-valued string/int field for the
> > recipient IDs?
> >
> > Regards,
> > Gora
> >
> >
>

Re: Solr faceted search UI

2013-05-28 Thread Fergus McDowall

Hi Richa

Solrstrap is probably the best way to go if you just want to get up a PoC
as fast as possible. Solrstrap requires no installation of middleware, you
just add in the address of your solr server and open the file in your
browser.

Regards
Fergus



On Wed, Apr 24, 2013 at 5:23 PM, richa  wrote:

> Thank you very much for your suggestion.
> This is only for PoC. As you suggested about blacklight, can I run this on
> windows and to build PoC do I have to have ruby on rails knowledge?
>
> Irrespective of any technology and considering the fact that in past I had
> worked on java, j2ee what would you suggest or how would you have proceeded
> for this?
>
> Blacklight seems to be a good option, not sure without prior knowledge of
> ruby on rails, will I be able to present in short period of time? any
> suggestion on this?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-faceted-search-UI-tp4058598p4058617.html
>  Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Solr faceted search UI

2013-05-28 Thread Fergus McDowall

You also get some smooth UI stuff "for free"

F

On Tue, May 28, 2013 at 10:58 AM, Fergus McDowall
wrote:

> Hi Richa
>
> Solrstrap is probably the best way to go if you just want to get up a PoC
> as fast as possible. Solrstrap requires no installation of middleware, you
> just add in the address of your solr server and open the file in your
> browser.
>
> Regards
> Fergus
>
>
>
> On Wed, Apr 24, 2013 at 5:23 PM, richa  wrote:
>
>> Thank you very much for your suggestion.
>> This is only for PoC. As you suggested about blacklight, can I run this on
>> windows and to build PoC do I have to have ruby on rails knowledge?
>>
>> Irrespective of any technology and considering the fact that in past I had
>> worked on java, j2ee what would you suggest or how would you have
>> proceeded
>> for this?
>>
>> Blacklight seems to be a good option, not sure without prior knowledge of
>> ruby on rails, will I be able to present in short period of time? any
>> suggestion on this?
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Solr-faceted-search-UI-tp4058598p4058617.html
>>  Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>

Re: Solr 4.3: node is seen as active in Zk while in recovery mode + endless recovery

2013-05-28 Thread AlexeyK

The cluster state problem reported above is not an issue - it was caused by
our own code.
Speaking about the update log - i have noticed a strange behavior concerning
the replay. The replay is *supposed* to be done for a predefined number of
log entries, but actually it is always done for the whole last 2 tlogs.
RecentUpdates.update() reads log within  while (numUpdates <
numRecordsToKeep), while numUpdates is never incremented, so it exits when
the reader reaches EOF. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-3-node-is-seen-as-active-in-Zk-while-in-recovery-mode-endless-recovery-tp4065549p4066452.html
Sent from the Solr - User mailing list archive at Nabble.com.

Paging with all Hits

2013-05-28 Thread Andreas Niekler

Hello,

i indexed some monographs with solr. Within each document a have a
multi-valued field where i store the paragraphs. When i search for a
specific term within the monographs i get the whole monograph as a
result object. The single hits can be accessed via the highlight
component. The prevents server side pageing with all the hits within one
monograph.

Is there either a possibility to page results within a multi-valued
field instead of the whole documents? Can i show each single value of a
mutli valued field as result?

Or can i page the highlighted results (All of them) without showing the
documents?

Thank you very much

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniek...@informatik.uni-leipzig.deg.de

What exactly happens to extant documents when the schema changes?

2013-05-28 Thread Dotan Cohen

When adding or removing a text field to/from the schema and then
restarting Solr, what exactly happens to extant documents? Is the
schema only consulted when Solr writes a document, therefore extant
documents are unaffected?

Considering that Solr supports dynamic fields, my experimentation with
removing and adding fields to the schema has shown almost no change in
the extant index results returned.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: What exactly happens to extant documents when the schema changes?

2013-05-28 Thread Upayavira

On Tue, May 28, 2013, at 10:21 AM, Dotan Cohen wrote:
> When adding or removing a text field to/from the schema and then
> restarting Solr, what exactly happens to extant documents? Is the
> schema only consulted when Solr writes a document, therefore extant
> documents are unaffected?
> 
> Considering that Solr supports dynamic fields, my experimentation with
> removing and adding fields to the schema has shown almost no change in
> the extant index results returned.

The schema provides Solr with a description of what it will find in the
Lucene indexes. If you, for example, changed a string field to an
integer in your schema, that'd mess things up bigtime. I recently had to
upgrade a date field from the 1.4.1 date field format to the newer
TrieDateField. Given I had to do it on a live index, I had to add a new
field (just using copyfield) and re-index over the top, as the old field
was still in use. I guess, given my app now uses the new date field
only, I could presumably reindex the old date field with the new
TrieDateField format, but I'd want to try that before I do it for real.

However, if you changed a single valued field to a multi-valued one,
that's not an issue, as a field with a single value is still valid for a
multi-valued field.

Also, if you add a new field, existing documents will be considered to
have no value in that field. If that is acceptable, then you're fine.

I guess if you remove a field, then those fields will be ignored by
Solr, and thus not impact anything. But I have to say, I've never tried
that.

Thus - changing the schema will only impact on future indexing. Whether
your existing index will still be valid depends upon the changes you are
making.

Upayavira

Re: Strange behavior on text field with number-text content

2013-05-28 Thread Alexandre Rafalovitch

 What does analyzer screen say in the Web AdminUI when you try to do that?
Also, what are the tokens stored in the field (also in Web AdminUI).

I think it is very strange to have TextField without a tokenizer chain.
Maybe you get a standard one assigned by default, but I don't know what the
standard chain would be.

Regards,

  Alex.
On 28 May 2013 04:44, "Michał Matulka"  wrote:

> Hello,
>
> I've got following problem. I have a text type in my schema and a field
> "name" of that type.
> That field contains a data, there is, for example, record that has
> "300letters" as name.
>
> Now field type definition:
> 
>
> And, of course, field definition:
> 
>
> yes, that's all - there are no tokenizers.
>
> And now time for my question:
>
> Why following queries:
>
> name:300
>
> and
>
> name:letters
>
> are returning that result, but:
>
> name:300letters
>
> is not (0 results)?
>
> Best regards,
> Michał Matulka
>

Re: Core admin action "CREATE" fails to persist some settings in solr.xml with Solr 4.3

2013-05-28 Thread Erick Erickson

Hmmm, that's the second time somebody's had that problem. It's
assigned to me now anyway, thanks for creating it!

Erick

On Mon, May 27, 2013 at 10:11 AM, André Widhani
 wrote:
> I created SOLR-4862 ... I found no way to assign the ticket to somebody 
> though (I guess it is is under "Workflow", but the button is greyed out).
>
> Thanks,
> André
>

Associate item with more than one location

2013-05-28 Thread Spadez

 currently have an item which gets imported into solr, lets call it a book
entry. Well that has a single location associated with it as a coordinate
and location name but I am now finding out that a single entry may actually
need to be associated with more than one location, for example "New York"
and "London".

Is it possible for solr to have items associated with more than one
location, so that when searching geo spatially on coordinates it can be
found at both locations?

Thank you



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Associate-item-with-more-than-one-location-tp4066477.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Strange behavior on text field with number-text content

2013-05-28 Thread Erick Erickson

Hmmm, with 4.x I get much different behavior than you're
describing, what version of Solr are you using?

Besides Alex's comments, try adding &debug=query to the url and see what comes
out from the query parser.

A quick glance at the code shows that DefaultAnalyzer is used, which doesn't do
any analysis, here's the javadoc...
 /**
   * Default analyzer for types that only produces 1 verbatim token...
   * A maximum size of chars to be read must be specified
   */

so it's much like the "string" type. Which means I'm totally perplexed by your
statement that 300 and letters return a hit. Have you perhaps changed the
field definition and not re-indexed?

The behavior you're seeing really looks like somehow WordDelimiterFilterFactory
is getting into your analysis chain with settings that don't mash the parts back
together, i.e. you can set up WDDF to split on letter/number transitions, index
each and NOT index the original, but I have no explanation for how that
could happen with the field definition you indicated

FWIW,
Erick

On Tue, May 28, 2013 at 7:47 AM, Alexandre Rafalovitch
 wrote:
>  What does analyzer screen say in the Web AdminUI when you try to do that?
> Also, what are the tokens stored in the field (also in Web AdminUI).
>
> I think it is very strange to have TextField without a tokenizer chain.
> Maybe you get a standard one assigned by default, but I don't know what the
> standard chain would be.
>
> Regards,
>
>   Alex.
> On 28 May 2013 04:44, "Michał Matulka"  wrote:
>
>> Hello,
>>
>> I've got following problem. I have a text type in my schema and a field
>> "name" of that type.
>> That field contains a data, there is, for example, record that has
>> "300letters" as name.
>>
>> Now field type definition:
>> 
>>
>> And, of course, field definition:
>> 
>>
>> yes, that's all - there are no tokenizers.
>>
>> And now time for my question:
>>
>> Why following queries:
>>
>> name:300
>>
>> and
>>
>> name:letters
>>
>> are returning that result, but:
>>
>> name:300letters
>>
>> is not (0 results)?
>>
>> Best regards,
>> Michał Matulka
>>

Re: Solr 4.3: node is seen as active in Zk while in recovery mode + endless recovery

2013-05-28 Thread Shalin Shekhar Mangar

This sounds like a bug. I'll open an issue. Thanks!


On Tue, May 28, 2013 at 2:29 PM, AlexeyK  wrote:

> The cluster state problem reported above is not an issue - it was caused by
> our own code.
> Speaking about the update log - i have noticed a strange behavior
> concerning
> the replay. The replay is *supposed* to be done for a predefined number of
> log entries, but actually it is always done for the whole last 2 tlogs.
> RecentUpdates.update() reads log within  while (numUpdates <
> numRecordsToKeep), while numUpdates is never incremented, so it exits when
> the reader reaches EOF.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-4-3-node-is-seen-as-active-in-Zk-while-in-recovery-mode-endless-recovery-tp4065549p4066452.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Regards,
Shalin Shekhar Mangar.

Re: Strange behavior on text field with number-text content

2013-05-28 Thread Michał Matulka


  
  
Thanks for your responses, I must admit
  that after hours of trying I made some mistakes.
  So the most problematic phrase will now be:
  "4nSolution Inc." which cannot be found using query:
  
  name:4nSolution
  
  or even
  
  name:4nSolution Inc.
  
  but can be using following queries:
  
  name:nSolution
  name:4
  name:inc
  
  Sorry for the mess, it turned out I didn't reindex fields after
  modyfying schema so I thought that the problem also applies to
  300letters .
  
  The cause of all of this is the WordDelimiter filter defined as
  following:
  
  
    
      
      
      
      
      ignoreCase="true"
      words="stopwords.txt"
      enablePositionIncrements="true"
      />
      
      
      
    
    
      
      
      
      ignoreCase="true"
      words="stopwords.txt"
      enablePositionIncrements="true"
      />
      
      
      
    
      
  
  and I still don't know why it behaves like that - after all there
  is "preserveOriginal" attribute set to 1...
  
  W dniu 28.05.2013 14:21, Erick Erickson pisze:


  Hmmm, with 4.x I get much different behavior than you're
describing, what version of Solr are you using?

Besides Alex's comments, try adding &debug=query to the url and see what comes
out from the query parser.

A quick glance at the code shows that DefaultAnalyzer is used, which doesn't do
any analysis, here's the javadoc...
 /**
   * Default analyzer for types that only produces 1 verbatim token...
   * A maximum size of chars to be read must be specified
   */

so it's much like the "string" type. Which means I'm totally perplexed by your
statement that 300 and letters return a hit. Have you perhaps changed the
field definition and not re-indexed?

The behavior you're seeing really looks like somehow WordDelimiterFilterFactory
is getting into your analysis chain with settings that don't mash the parts back
together, i.e. you can set up WDDF to split on letter/number transitions, index
each and NOT index the original, but I have no explanation for how that
could happen with the field definition you indicated

FWIW,
Erick

On Tue, May 28, 2013 at 7:47 AM, Alexandre Rafalovitch
 wrote:

  
 What does analyzer screen say in the Web AdminUI when you try to do that?
Also, what are the tokens stored in the field (also in Web AdminUI).

I think it is very strange to have TextField without a tokenizer chain.
Maybe you get a standard one assigned by default, but I don't know what the
standard chain would be.

Regards,

  Alex.
On 28 May 2013 04:44, "Michał Matulka"  wrote:



  Hello,

I've got following problem. I have a text type in my schema and a field
"name" of that type.
That field contains a data, there is, for example, record that has
"300letters" as name.

Now field type definition:


And, of course, field definition:


yes, that's all - there are no tokenizers.

And now time for my question:

Why following queries:

name:300

and

name:letters

are returning that result, but:

name:300letters

is not (0 results)?

Best regards,
Michał Matulka



  
  




-- 
  
 Pozdrawiam,
  Michał Matulka    
 Programista
 michal.matu...@gowork.pl
  

  
 
 ul. Zielna 39
 00-108 Warszawa
 www.GoWork.pl

Re: What exactly happens to extant documents when the schema changes?

2013-05-28 Thread Jack Krupansky


The technical answer: Undefined and not guaranteed.

Sure, you can experiment and see what the effects "happen" to be in any 
given release, and maybe they don't tend to change (too much) between most 
releases, but there is no guarantee that any given "change schema but keep 
existing data without a delete of directory contents and full reindex" will 
actually be benign or what you expect.


As a general proposition, when it comes to changing the schema and not 
deleting the directory and doing a full reindex, don't do it! Of course, we 
all know not to try to walk on thin ice, but a lot of people will try to do 
it anyway - and maybe it happens that most of the time the results are 
benign.


OTOH, you could file a Jira to propose that the effects of changing the 
schema but keeping the existing data should be precisely defined and 
documented, but, that could still change from release to release.


From a practical perspective for your original question: If you suddenly add 
a field, there is no guarantee what will happen when you try to access that 
field for existing documents, or what will happen if you "update" existing 
documents. Sure, people can talk about what "happens to be true today", but 
there is no guarantee for the future. Similarly for deleting a field from 
the schema, there is no guarantee about the status of existing data, even 
though people can chatter about "what it seems to do today."


Generally, you should design your application around contracts and what is 
guaranteed to be true, not what happens to be true from experiments or even 
experience. Granted, that is the theory and sometimes you do need to rely on 
experimentation and folklore and spotty or ambiguous documentation, but to 
the extent possible, it is best to avoid explicitly trying to rely on 
undocumented, uncontracted behavior.


One question I asked long ago and never received an answer: what is the best 
practice for doing a full reindex - is it sufficient to first do a delete of 
"*:*", or does the Solr index directory contents or even the directory 
itself need to be explicitly deleted first? I believe it is the latter, but 
the former "seems" to work, most of the time. Deleting the directory itself 
"seems" to be the best answer, to date - but no guarantees!



-- Jack Krupansky

-Original Message- 
From: Dotan Cohen

Sent: Tuesday, May 28, 2013 5:21 AM
To: solr-user@lucene.apache.org
Subject: What exactly happens to extant documents when the schema changes?

When adding or removing a text field to/from the schema and then
restarting Solr, what exactly happens to extant documents? Is the
schema only consulted when Solr writes a document, therefore extant
documents are unaffected?

Considering that Solr supports dynamic fields, my experimentation with
removing and adding fields to the schema has shown almost no change in
the extant index results returned.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Paging with all Hits

2013-05-28 Thread Jack Krupansky

Dynamic and multi-valued fields are both powerful but dangerous features. 
Yes, there offer wonderful capabilities - if used within moderation, but 
expecting that they are "get out of jail free / go past go as many times as 
you want" cards to ignore the limits of Solr and do anything you want is a 
really bad idea.


Yes, the fact that multi-valued fields are not first-class Lucene/Solr 
objects is a problem, but the limitations were all known in advance and no 
guarantees were made, so you don't have much of an excuse now, other than to 
lament the fact that somebody conned you into believing that multi-valued 
fields were some kind of magic elixir, a magic "escape hatch" to a world 
where the limits of Lucene and Solr don't apply. Sigh.


Multi-valued field are great for storing a "few" related items, even 
"dozens" of them, typically modest-length strings. But storing "hundreds" or 
"thousands" or storing large bulky items (e.g., entire contents of the text 
of a page) are a really bad idea. Sure, maybe it does work, at least for 
some cases, for some people, some of the time, but that shouldn't be the 
criteria for building a robust production application.


A couple of the serious limitations of multi-valued fields are that 
individual elements cannot be "addressed", either to insert or delete or 
move, or to receive an indication of which matched. Sorry, but Lucene and 
Solr do not have "sub-documents", which is the "get out of jail free" card 
that a lot of people expect with multi-valued (and dynamic) fields.


If you want an object to be a first-class object, make it a separate Solr 
document. Bite the bullet, and live with it.


-- Jack Krupansky

-Original Message- 
From: Andreas Niekler

Sent: Tuesday, May 28, 2013 5:10 AM
To: solr-user@lucene.apache.org
Subject: Paging with all Hits

Hello,

i indexed some monographs with solr. Within each document a have a
multi-valued field where i store the paragraphs. When i search for a
specific term within the monographs i get the whole monograph as a
result object. The single hits can be accessed via the highlight
component. The prevents server side pageing with all the hits within one
monograph.

Is there either a possibility to page results within a multi-valued
field instead of the whole documents? Can i show each single value of a
mutli valued field as result?

Or can i page the highlighted results (All of them) without showing the
documents?

Thank you very much

--
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniek...@informatik.uni-leipzig.deg.de

Wiki pages for Solr releases

2013-05-28 Thread Jan Høydahl

Hi,

I have added the missing WIKI pages for

https://wiki.apache.org/solr/Solr4.1
https://wiki.apache.org/solr/Solr4.2
https://wiki.apache.org/solr/Solr4.3

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

Re: Strange behavior on text field with number-text content

2013-05-28 Thread Алексей Цой

solr-user-unsubscribe 


2013/5/28 Michał Matulka 

>  Thanks for your responses, I must admit that after hours of trying I
> made some mistakes.
> So the most problematic phrase will now be:
> "4nSolution Inc." which cannot be found using query:
>
> name:4nSolution
>
> or even
>
> name:4nSolution Inc.
>
> but can be using following queries:
>
> name:nSolution
> name:4
> name:inc
>
> Sorry for the mess, it turned out I didn't reindex fields after modyfying
> schema so I thought that the problem also applies to 300letters .
>
> The cause of all of this is the WordDelimiter filter defined as following:
>
> 
>   
> 
> 
> 
>  ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"
> />
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
> preserveOriginal="1"/>
> 
>  language="English" protected="protwords.txt"/>
>   
>   
> 
>  ignoreCase="true" expand="true"/>
>  ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"
> />
>  generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="1" splitOnCaseChange="0"
> preserveOriginal="1" />
> 
>  language="English" protected="protwords.txt"/>
>   
> 
>
> and I still don't know why it behaves like that - after all there is
> "preserveOriginal" attribute set to 1...
>
> W dniu 28.05.2013 14:21, Erick Erickson pisze:
>
> Hmmm, with 4.x I get much different behavior than you're
> describing, what version of Solr are you using?
>
> Besides Alex's comments, try adding &debug=query to the url and see what comes
> out from the query parser.
>
> A quick glance at the code shows that DefaultAnalyzer is used, which doesn't 
> do
> any analysis, here's the javadoc...
>  /**
>* Default analyzer for types that only produces 1 verbatim token...
>* A maximum size of chars to be read must be specified
>*/
>
> so it's much like the "string" type. Which means I'm totally perplexed by your
> statement that 300 and letters return a hit. Have you perhaps changed the
> field definition and not re-indexed?
>
> The behavior you're seeing really looks like somehow 
> WordDelimiterFilterFactory
> is getting into your analysis chain with settings that don't mash the parts 
> back
> together, i.e. you can set up WDDF to split on letter/number transitions, 
> index
> each and NOT index the original, but I have no explanation for how that
> could happen with the field definition you indicated
>
> FWIW,
> Erick
>
> On Tue, May 28, 2013 at 7:47 AM, Alexandre Rafalovitch 
>  wrote:
>
>   What does analyzer screen say in the Web AdminUI when you try to do that?
> Also, what are the tokens stored in the field (also in Web AdminUI).
>
> I think it is very strange to have TextField without a tokenizer chain.
> Maybe you get a standard one assigned by default, but I don't know what the
> standard chain would be.
>
> Regards,
>
>   Alex.
> On 28 May 2013 04:44, "Michał Matulka"  
>  wrote:
>
>
>  Hello,
>
> I've got following problem. I have a text type in my schema and a field
> "name" of that type.
> That field contains a data, there is, for example, record that has
> "300letters" as name.
>
> Now field type definition:
> 
>
> And, of course, field definition:
> 
>
> yes, that's all - there are no tokenizers.
>
> And now time for my question:
>
> Why following queries:
>
> name:300
>
> and
>
> name:letters
>
> are returning that result, but:
>
> name:300letters
>
> is not (0 results)?
>
> Best regards,
> Michał Matulka
>
>
>
>
> --
>  Pozdrawiam,
> Michał Matulka
>  Programista
>  michal.matu...@gowork.pl
>
>
>  *[image: GoWork.pl]*
>  ul. Zielna 39
>  00-108 Warszawa
>  www.GoWork.pl
>

Disable all caches in solr

2013-05-28 Thread yriveiro

Hi, 

How I can disable all caches that solr use?

Regards

/Yago



-
Best regards
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Disable-all-caches-in-solr-tp4066517.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Paging with all Hits

2013-05-28 Thread Alexandre Rafalovitch

I feel that the strength of the Jack's rant is somewhat unprovoked by
the original question. I also feel that the rant itself is worth being
printed and framed :-)

But more than anything else, I feel that supposedly-known limitations
of Solr/Lucene are not actually exposed all that much. Certainly, for
myself, I did not see those iron-clad BEWARE OF THE DRAGONS signs
anywhere on the Wiki or otherwise. I feel that they are more like Zen
aspects that one learns by reading between the lines of various forum
messages and by thinking through the presentations such as Adrian
Trenaman's (on Gilt's experience).

Maybe the books are supposed to do that, but even they, I feel, are
failing to do it perfectly (including my own, I am sure).

Just a thought.

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)

On Tue, May 28, 2013 at 9:19 AM, Jack Krupansky  wrote:
> Yes, the fact that multi-valued fields are not first-class Lucene/Solr
> objects is a problem, but the limitations were all known in advance and no
> guarantees were made, so you don't have much of an excuse now, other than to
> lament the fact that somebody conned you into believing that multi-valued
> fields were some kind of magic elixir, a magic "escape hatch" to a world
> where the limits of Lucene and Solr don't apply. Sigh.

Re: Disable all caches in solr

2013-05-28 Thread Shalin Shekhar Mangar

Edit the solrconfig.xml and remove/comment , ,
. Note that some caches such as FieldCache (created for
sorting/faceting on demand) cannot be disabled.

On Tue, May 28, 2013 at 8:10 PM, yriveiro  wrote:

> Hi,
>
> How I can disable all caches that solr use?
>
> Regards
>
> /Yago
>
>
>
> -
> Best regards
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Disable-all-caches-in-solr-tp4066517.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

-- 
Regards,
Shalin Shekhar Mangar.

Re: Disable all caches in solr

2013-05-28 Thread Yago Riveiro

Indeed, I commented all entries for cache in solrconfig, but solrmeter shows me 
cache for field cache type, Now I know why. 

Thanks Shalin,

-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, May 28, 2013 at 3:53 PM, Shalin Shekhar Mangar wrote:

> Edit the solrconfig.xml and remove/comment , ,
> . Note that some caches such as FieldCache (created for
> sorting/faceting on demand) cannot be disabled.
> 
> 
> On Tue, May 28, 2013 at 8:10 PM, yriveiro  (mailto:yago.rive...@gmail.com)> wrote:
> 
> > Hi,
> > 
> > How I can disable all caches that solr use?
> > 
> > Regards
> > 
> > /Yago
> > 
> > 
> > 
> > -
> > Best regards
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Disable-all-caches-in-solr-tp4066517.html
> > Sent from the Solr - User mailing list archive at Nabble.com 
> > (http://Nabble.com).
> > 
> 
> 
> 
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.
> 
>

Re: Paging with all Hits

2013-05-28 Thread Jack Krupansky

:)

-- Jack Krupansky
-Original Message- 
From: Alexandre Rafalovitch

Sent: Tuesday, May 28, 2013 10:41 AM
To: solr-user@lucene.apache.org
Subject: Re: Paging with all Hits

I feel that the strength of the Jack's rant is somewhat unprovoked by
the original question. I also feel that the rant itself is worth being
printed and framed :-)

But more than anything else, I feel that supposedly-known limitations
of Solr/Lucene are not actually exposed all that much. Certainly, for
myself, I did not see those iron-clad BEWARE OF THE DRAGONS signs
anywhere on the Wiki or otherwise. I feel that they are more like Zen
aspects that one learns by reading between the lines of various forum
messages and by thinking through the presentations such as Adrian
Trenaman's (on Gilt's experience).

Maybe the books are supposed to do that, but even they, I feel, are
failing to do it perfectly (including my own, I am sure).

Just a thought.

Regards,
  Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)

On Tue, May 28, 2013 at 9:19 AM, Jack Krupansky  
wrote:

Yes, the fact that multi-valued fields are not first-class Lucene/Solr
objects is a problem, but the limitations were all known in advance and no
guarantees were made, so you don't have much of an excuse now, other than 
to

lament the fact that somebody conned you into believing that multi-valued
fields were some kind of magic elixir, a magic "escape hatch" to a world
where the limits of Lucene and Solr don't apply. Sigh.

Solr Composite Unique key from existing fields in schema

2013-05-28 Thread Rishi Easwaran

Hi All,

Historically we have used a single field in our schema as a uniqueKey.

  
   
docid

Wanted to change this to a composite key something like 
userid-docid.
I know I can auto generate compositekey at document insert time, using custom 
code to generate a new field, but wanted to know if there was an inbuilt SOLR 
mechanism of doing this. That would prevent us from creating and storing an 
extra field.

Thanks,

Rishi.

Re: Solr Composite Unique key from existing fields in schema

2013-05-28 Thread Jan Høydahl

The cleanest is to do this from the outside.

Alternatively, it will perhaps work to populate your uniqueKey in a custom 
UpdateProcessor. You can try.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

28. mai 2013 kl. 17:12 skrev Rishi Easwaran :

> Hi All,
> 
> Historically we have used a single field in our schema as a uniqueKey.
> 
>   multiValued="false" required="true"/>
>   multiValued="false" required="true"/> 
> docid
> 
> Wanted to change this to a composite key something like 
> userid-docid.
> I know I can auto generate compositekey at document insert time, using custom 
> code to generate a new field, but wanted to know if there was an inbuilt SOLR 
> mechanism of doing this. That would prevent us from creating and storing an 
> extra field.
> 
> Thanks,
> 
> Rishi.
> 
> 
> 
>

Re: Not able to search Spanish word with ascent in solr

2013-05-28 Thread jignesh

Hello Jack

Thanks for your reply..

I have tried to add below contents to solr, as you suggest

-

  
doc-1
Hola Mañana en le Café, habla el Académie
française!
  

--

BUT I am getting below error
--
I:\Program
Files\EasyPHP-5.3.9\www\solr\apache-solr-3.6.2\example\exampledocs>java -jar
post.jar hd.xml
SimplePostTool: version 1.4
SimplePostTool: POSTing files to http://localhost:8983/solr/update..
SimplePostTool: POSTing file hd.xml
SimplePostTool: FATAL: Solr returned an error #400 Invalid UTF-8 middle byte
0x6e (at char #916, byte #-1)
--

I have also added 
--

---

And then submit doc to solr

Please guide





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Not-able-to-search-Spanish-word-with-ascent-in-solr-tp4064404p4066537.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: delta-import tweaking?

2013-05-28 Thread Shawn Heisey


On 5/28/2013 12:31 AM, Kristian Rink wrote:

(a) The usual tutorials outline something like

WHERE LASTMODIFIED > '${dih.last_index_time}


[snip]


(b) I see that "last_index_time" returns a particularly fixed format.
In our database, with a modestly more complex SELECT, we also could
figure out which entities have been changed using some protocol table
which includes timestamps in seconds since EPOCH. Is there some way of
retrieving such a timestamp from DataImportHandler or will I have to do
so somehow on my own?


Your situation sounds like mine.  I found a workaround, but filed 
SOLR-1920 anyway to try and get support for tracking something besides 
the current time.  After nearly three years with no motion, and not 
being able to do it myself, I finally closed it:


https://issues.apache.org/jira/browse/SOLR-1920

My workaround was to store the highest indexed autoincrement value in a 
location outside Solr.  In my original Perl code, I dropped it into a 
file on NFS.  The latest iteration of my indexing code (Java, using 
SolrJ) no longer uses DIH for regular indexing, but it still uses that 
stored autoincrement value, this time in another database table.  I do 
still use full-import for complete index rebuilds.


You can pass arbitrary parameters into Solr via the dataimport URL.  If 
you pass in a variable called maxId, then you can access that in your 
DIH config with ${dih.request.maxId} and use it any way you like.


Thanks,
Shawn

Re: Distributed query: strange behavior.

2013-05-28 Thread Valery Giner


Eric,

Thank you for the explanation.

My problem was that allowing the docs with the same unique ids  to be 
present in the multiple shards in a "normal" situation,
makes it impossible to estimate the number of shards needed for an index 
with a "really large" number of docs.


Thanks,
Val

On 05/26/2013 11:16 AM, Erick Erickson wrote:

Valery:

I share your puzzlement. _If_ you are letting Solr do the document
routing, and not doing any of the custom routing, then the same unique
key should be going to the same shard and replacing the previous doc
with that key.

But, if you're using custom routing, if you've been experimenting with
different configurations and didn't start over, in general if you're
configuration is in an "interesting" state this could happen.

So in the normal case if you have a document with the same key indexed
in multiple shards, that would indicate a bug. But there are many
ways, especially when experimenting, that you could have this happen
which are _not_ a bug. I'm guessing that Luis may be trying the custom
routing option maybe?

Best
Erick

On Fri, May 24, 2013 at 9:09 AM, Valery Giner  wrote:

Shawn,

How is it possible for more than one document with the same unique key to
appear in the index, even in different shards?
Isn't it a bug by definition?
What am I missing here?

Thanks,
Val


On 05/23/2013 09:55 AM, Shawn Heisey wrote:

On 5/23/2013 1:51 AM, Luis Cappa Banda wrote:

I've query each Solr shard server one by one and the total number of
documents is correct. However, when I change rows parameter from 10 to
100
the total numFound of documents change:

I've seen this problem on the list before and the cause has been
determined each time to be caused by documents with the same uniqueKey
value appearing in more than one shard.

What I think happens here:

With rows=10, you get the top ten docs from each of the three shards,
and each shard sends its numFound for that query to the core that's
coordinating the search.  The coordinator adds up numFound, looks
through those thirty docs, and arranges them according to the requested
sort order, returning only the top 10.  In this case, there happen to be
no duplicates.

With rows=100, you get a total of 300 docs.  This time, duplicates are
found and removed by the coordinator.  I think that the coordinator
adjusts the total numFound by the number of duplicate documents it
removed, in an attempt to be more accurate.

I don't know if adjusting numFound when duplicates are found in a
sharded query is the right thing to do, I'll leave that for smarter
people.  Perhaps Solr should return a message with the results saying
that duplicates were found, and if a config option is not enabled, the
server should throw an exception and return a 4xx HTTP error code.  One
idea for a config parameter name would be allowShardDuplicates, but
something better can probably be found.

Thanks,
Shawn

Re: Not able to search Spanish word with ascent in solr

2013-05-28 Thread jignesh

Hello Steve

Thanks for your reply

I don't want to upgrade  solr 4
so your suggestion will be as below

---
you should instead convert these HTML character entities yourself to the
characters they represent (e.g. "é" -> "é") before sending the
docs to Solr. 
---

Please let me know how can I use your above suggestion.

Pls note : When I try to add data to solr with ascent chars, it showing me
error

--
FATAL: Solr returned an error #400 Invalid UTF-8 middle byte 0x6e (at char
#916, byte #-1)
---

as mention in earlier post...

Please guide



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Not-able-to-search-Spanish-word-with-ascent-in-solr-tp4064404p4066544.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Associate item with more than one location

2013-05-28 Thread Smiley, David W.

Absolutely.  Use "location_rpt" in the example schema.  Do *not* use
LatLonType, which doesn't support multiValued data.

~ David Smiley

On 5/28/13 8:02 AM, "Spadez"  wrote:

> currently have an item which gets imported into solr, lets call it a book
>entry. Well that has a single location associated with it as a coordinate
>and location name but I am now finding out that a single entry may
>actually
>need to be associated with more than one location, for example "New York"
>and "London".
>
>Is it possible for solr to have items associated with more than one
>location, so that when searching geo spatially on coordinates it can be
>found at both locations?
>
>Thank you
>
>
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/Associate-item-with-more-than-one-locat
>ion-tp4066477.html
>Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Composite Unique key from existing fields in schema

2013-05-28 Thread Jack Krupansky


You can do this by combining the builtin update processors.

Add this to your solrconfig:


 
   docid_s
   userid_s
   id
 
 
   id
   --
 
 
 


Add documents such as:

curl 
"http://localhost:8983/solr/update?commit=true&update.chain=composite-id"; \

-H 'Content-type:application/json' -d '
[{"title": "Hello World",
 "docid_s": "doc-1",
 "userid_s": "user-1",
 "comments_ss": ["Easy", "Fast"]}]'

And get results like:

"title":["Hello World"],
"docid_s":"doc-1",
"userid_s":"user-1",
"comments_ss":["Easy",
 "Fast"],
"id":"doc-1--user-1",

Add as many fields in whatever order you want using "source" in the clone 
update processor, and pick your composite key field name as well. And set 
the delimiter string as well in the concat update processor.


I managed to reverse the field order from what you requested (userid, 
docid).


I used the standard Solr example schema, so I used dynamic fields for the 
two ids, but use your own field names.


-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran

Sent: Tuesday, May 28, 2013 11:12 AM
To: solr-user@lucene.apache.org
Subject: Solr Composite Unique key from existing fields in schema

Hi All,

Historically we have used a single field in our schema as a uniqueKey.

 multiValued="false" required="true"/>
 multiValued="false" required="true"/>

docid

Wanted to change this to a composite key something like 
userid-docid.
I know I can auto generate compositekey at document insert time, using 
custom code to generate a new field, but wanted to know if there was an 
inbuilt SOLR mechanism of doing this. That would prevent us from creating 
and storing an extra field.


Thanks,

Rishi.

Re: Not able to search Spanish word with ascent in solr

2013-05-28 Thread Jack Krupansky

I copied those accented words directly from web pages in Google Chrome on a 
Windows PC, but then copied them to a text file as well, so their encoding 
is dubious. You will have to make sure to use accented characters for UTF-8 
in your environment. And... make sure that you are using an editor that is 
not mangling the encodings.


I just tried a test where I copied the text from your email response and 
added the XML header line you used, and it posted fine to Solr, but I am 
running Solr 4.3. I used vi under Cygwin for the editing.


-- Jack Krupansky

-Original Message- 
From: jignesh

Sent: Tuesday, May 28, 2013 11:30 AM
To: solr-user@lucene.apache.org
Subject: Re: Not able to search Spanish word with ascent in solr

Hello Jack

Thanks for your reply..

I have tried to add below contents to solr, as you suggest

-

 
   doc-1
   Hola Mañana en le Café, habla el Académie
française!
 

--

BUT I am getting below error
--
I:\Program
Files\EasyPHP-5.3.9\www\solr\apache-solr-3.6.2\example\exampledocs>java -jar
post.jar hd.xml
SimplePostTool: version 1.4
SimplePostTool: POSTing files to http://localhost:8983/solr/update..
SimplePostTool: POSTing file hd.xml
SimplePostTool: FATAL: Solr returned an error #400 Invalid UTF-8 middle byte
0x6e (at char #916, byte #-1)
--

I have also added
--

---

And then submit doc to solr

Please guide





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Not-able-to-search-Spanish-word-with-ascent-in-solr-tp4064404p4066537.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Note on The Book

2013-05-28 Thread Alexandre Rafalovitch

Jack,

It is worth considering something like https://leanpub.com/ . That way
people can pre-pay for the result and enjoy (however 'draft'-y)
results earlier.

In terms of reference vs narrative, my strong desire would have been
for the narrative part. The problem always seems to be around
understanding how the pieces/flow fit together and - only then - what
specific parameters have what syntax.

For printed books, I'd probably go for a ring binder for basic version
and maybe combined hard-cover for premium one. The premium one would
be the one you get office to buy or as a present. :-)

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Thu, May 23, 2013 at 7:14 PM, Jack Krupansky  wrote:
> To those of you who may have heard about the Lucene/Solr book that I and two 
> others are writing on Lucene and Solr, some bad and good news. The bad news: 
> The book contract with O’Reilly has been canceled. The good news: I’m going 
> to proceed with self-publishing (possibly on Lulu or even Amazon) a somewhat 
> reduced scope Solr-only Reference Guide (with hints of Lucene). The scope of 
> the previous effort was too great, even for O’Reilly – a book larger than 800 
> pages (or even 600) that was heavy on reference and lighter on “guide” just 
> wasn’t fitting in with their traditional “guide” model. In truth, Solr is 
> just too complex for a simple guide that covers it all, let alone Lucene as 
> well.
>
> I’ll announce more details in the coming weeks, but I expect to publish an 
> e-book-only version of the book, focused on Solr reference (and plenty of 
> guide as well), possibly on Lulu, plus eventually publish 4-8 individual 
> print volumes for people who really want the paper. One model I may pursue is 
> to offer the current, incomplete, raw, rough, draft as a $7.99 e-book, with 
> the promise of updates every two weeks or a month as new and revised content 
> and new releases of Solr become available. Maybe the individual e-book 
> volumes would be $2 or $3. These are just preliminary ideas. Feel free to let 
> me know what seems reasonable or excessive.
>
> For paper: Do people really want perfect bound, or would you prefer spiral 
> bound that lies flat and folds back easily? I suppose we could offer both – 
> which should be considered “premium”?
>
> I’ll announce more details next week. The immediate goal will be to get the 
> “raw rough draft” available to everyone ASAP.
>
> For those of you who have been early reviewers – your effort will not have 
> been in vain. I have all your comments and will address them over the next 
> month or two or three.
>
> Just for some clarity, the existing Solr Wiki and even the recent 
> contribution of the LucidWorks Solr Reference to Apache really are still 
> great contributions to general knowledge about Solr, but the book is intended 
> to go much deeper into detail, especially with loads of examples and a lot 
> more narrative guide. For example, the book has a complete list of the 
> analyzer filters, each with a clean one-liner description. Ditto for every 
> parameter (although I would note that the LucidWorks Solr Reference does a 
> decent job of that as well.) Maybe, eventually, everything in the book COULD 
> (and will) be integrated into the standard Solr doc, but until then, a 
> single, integrated reference really is sorely needed. And, the book has a lot 
> of narrative guide and walking through examples as well. Over time, I’m sure 
> both will evolve. And just to be clear, the book is not a simple repurposing 
> of the Solr wiki content – EVERY description of everything has been written 
> fresh, from scratch. So, for example, analyzer filters get both short 
> one-liner summary descriptions as well as more detailed descriptions, plus 
> formal attribute specifications and numerous examples, including sample input 
> and outputs (the LucidWorks Solr Reference does a better job with examples as 
> well.)
>
> The book has been written in parallel with branch_4x and that will continue.
>
> -- Jack Krupansky

Query syntax error: Cannot parse ....

2013-05-28 Thread yriveiro

Hi,

When I try run this query,
http://localhost:8983/solr/coreA/select?q=source_id:(7D1FFB# OR 7D1FFB)
city:ES, I have  the error below:



400
1



org.apache.solr.search.SyntaxError: Cannot parse 'source_id:(7D1FFB':
Encountered "" at line 1, column 43. Was expecting one of:  ...
 ...  ... "+" ... "-" ...  ... "(" ... ")" ... "*" ...
"^" ...  ...  ...  ...  ... 
...  ... "[" ... "{" ...  ...  ...

400



How I can fix the query?

Regards

/Yago



-
Best regards
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-syntax-error-Cannot-parse-tp4066560.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Composite Unique key from existing fields in schema

2013-05-28 Thread Rishi Easwaran

Thanks Jack, looks like that will do the trick from me. I will try it out. 

 

 

 

-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Tue, May 28, 2013 12:07 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


You can do this by combining the builtin update processors.

Add this to your solrconfig:


  
docid_s
userid_s
id
  
  
id
--
  
  
  


Add documents such as:

curl 
"http://localhost:8983/solr/update?commit=true&update.chain=composite-id"; \
-H 'Content-type:application/json' -d '
[{"title": "Hello World",
  "docid_s": "doc-1",
  "userid_s": "user-1",
  "comments_ss": ["Easy", "Fast"]}]'

And get results like:

"title":["Hello World"],
"docid_s":"doc-1",
"userid_s":"user-1",
"comments_ss":["Easy",
  "Fast"],
"id":"doc-1--user-1",

Add as many fields in whatever order you want using "source" in the clone 
update processor, and pick your composite key field name as well. And set 
the delimiter string as well in the concat update processor.

I managed to reverse the field order from what you requested (userid, 
docid).

I used the standard Solr example schema, so I used dynamic fields for the 
two ids, but use your own field names.

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran
Sent: Tuesday, May 28, 2013 11:12 AM
To: solr-user@lucene.apache.org
Subject: Solr Composite Unique key from existing fields in schema

Hi All,

Historically we have used a single field in our schema as a uniqueKey.

  
  
docid

Wanted to change this to a composite key something like 
userid-docid.
I know I can auto generate compositekey at document insert time, using 
custom code to generate a new field, but wanted to know if there was an 
inbuilt SOLR mechanism of doing this. That would prevent us from creating 
and storing an extra field.

Thanks,

Rishi.

RE: Note on The Book

2013-05-28 Thread Swati Swoboda

I'd definitely prefer the spiral bound as well. E-books are great and your 
draft version seems very reasonably priced (aka I would definitely get it). 

Really looking forward to this. Is there a separate mailing list / etc. for the 
book for those who would like to receive updates on the status of the book?

Thanks 

Swati Swoboda 
Software Developer - Igloo Software
+1.519.489.4120  sswob...@igloosoftware.com

Bring back Cake Fridays – watch a video you’ll actually like
http://vimeo.com/64886237


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Thursday, May 23, 2013 7:15 PM
To: solr-user@lucene.apache.org
Subject: Note on The Book

To those of you who may have heard about the Lucene/Solr book that I and two 
others are writing on Lucene and Solr, some bad and good news. The bad news: 
The book contract with O’Reilly has been canceled. The good news: I’m going to 
proceed with self-publishing (possibly on Lulu or even Amazon) a somewhat 
reduced scope Solr-only Reference Guide (with hints of Lucene). The scope of 
the previous effort was too great, even for O’Reilly – a book larger than 800 
pages (or even 600) that was heavy on reference and lighter on “guide” just 
wasn’t fitting in with their traditional “guide” model. In truth, Solr is just 
too complex for a simple guide that covers it all, let alone Lucene as well.

I’ll announce more details in the coming weeks, but I expect to publish an 
e-book-only version of the book, focused on Solr reference (and plenty of guide 
as well), possibly on Lulu, plus eventually publish 4-8 individual print 
volumes for people who really want the paper. One model I may pursue is to 
offer the current, incomplete, raw, rough, draft as a $7.99 e-book, with the 
promise of updates every two weeks or a month as new and revised content and 
new releases of Solr become available. Maybe the individual e-book volumes 
would be $2 or $3. These are just preliminary ideas. Feel free to let me know 
what seems reasonable or excessive.

For paper: Do people really want perfect bound, or would you prefer spiral 
bound that lies flat and folds back easily? I suppose we could offer both – 
which should be considered “premium”?

I’ll announce more details next week. The immediate goal will be to get the 
“raw rough draft” available to everyone ASAP.

For those of you who have been early reviewers – your effort will not have been 
in vain. I have all your comments and will address them over the next month or 
two or three.

Just for some clarity, the existing Solr Wiki and even the recent contribution 
of the LucidWorks Solr Reference to Apache really are still great contributions 
to general knowledge about Solr, but the book is intended to go much deeper 
into detail, especially with loads of examples and a lot more narrative guide. 
For example, the book has a complete list of the analyzer filters, each with a 
clean one-liner description. Ditto for every parameter (although I would note 
that the LucidWorks Solr Reference does a decent job of that as well.) Maybe, 
eventually, everything in the book COULD (and will) be integrated into the 
standard Solr doc, but until then, a single, integrated reference really is 
sorely needed. And, the book has a lot of narrative guide and walking through 
examples as well. Over time, I’m sure both will evolve. And just to be clear, 
the book is not a simple repurposing of the Solr wiki content – EVERY 
description of everything has been written fresh, from scratch. So, for 
example, analyzer filters get both short one-liner summary descriptions as well 
as more detailed descriptions, plus formal attribute specifications and 
numerous examples, including sample input and outputs (the LucidWorks Solr 
Reference does a better job with examples as well.)

The book has been written in parallel with branch_4x and that will continue.

-- Jack Krupansky

Re: Query syntax error: Cannot parse ....

2013-05-28 Thread gpssolr2020

Hi,

Try to pass URL encode value(%23) for # .


Thanks.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-syntax-error-Cannot-parse-tp4066560p4066566.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Composite Unique key from existing fields in schema

2013-05-28 Thread Rishi Easwaran

Jack,

No sure if this is the correct behaviour.
I set up updateRequestorPorcess chain as mentioned below, but looks like the 
compositeId that is generated is based on input order.

For example: 
If my input comes in as 
1
12345

 I get the following compositeId1-12345. 

If I reverse the input 

12345

1
I get the following compositeId 12345-1 . 
 

In this case the compositeId is not unique and I am getting duplicates.

Thanks,

Rishi.



-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Tue, May 28, 2013 12:07 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


You can do this by combining the builtin update processors.

Add this to your solrconfig:


  
docid_s
userid_s
id
  
  
id
--
  
  
  


Add documents such as:

curl 
"http://localhost:8983/solr/update?commit=true&update.chain=composite-id"; \
-H 'Content-type:application/json' -d '
[{"title": "Hello World",
  "docid_s": "doc-1",
  "userid_s": "user-1",
  "comments_ss": ["Easy", "Fast"]}]'

And get results like:

"title":["Hello World"],
"docid_s":"doc-1",
"userid_s":"user-1",
"comments_ss":["Easy",
  "Fast"],
"id":"doc-1--user-1",

Add as many fields in whatever order you want using "source" in the clone 
update processor, and pick your composite key field name as well. And set 
the delimiter string as well in the concat update processor.

I managed to reverse the field order from what you requested (userid, 
docid).

I used the standard Solr example schema, so I used dynamic fields for the 
two ids, but use your own field names.

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran
Sent: Tuesday, May 28, 2013 11:12 AM
To: solr-user@lucene.apache.org
Subject: Solr Composite Unique key from existing fields in schema

Hi All,

Historically we have used a single field in our schema as a uniqueKey.

  
  
docid

Wanted to change this to a composite key something like 
userid-docid.
I know I can auto generate compositekey at document insert time, using 
custom code to generate a new field, but wanted to know if there was an 
inbuilt SOLR mechanism of doing this. That would prevent us from creating 
and storing an extra field.

Thanks,

Rishi.

RE: Keeping a rolling window of indexes around solr

2013-05-28 Thread Saikat Kanjilal

At first glance unless I missed something hourglass will definitely not work 
for our use-case which just involves real time inserts of new log data and no 
appends at all.  However I would like to examine the guts of hourglass to see 
if we can customize it for our use-case.

> From: arafa...@gmail.com
> Date: Mon, 27 May 2013 16:17:12 -0400
> Subject: Re: Keeping a rolling window of indexes around solr
> To: solr-user@lucene.apache.org
> 
> But how is Hourglass going to help Solr? Or is it a portable implementation?
> 
> Regards,
>Alex.
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
> 
> 
> On Mon, May 27, 2013 at 3:48 PM, Otis Gospodnetic
>  wrote:
> > Hi,
> >
> > SolrCloud now has the same index aliasing as Elasticsearch.  I can't lookup
> > the link now but Zoie from LinkedIn has Hourglass, which is uses for
> > circular buffer sort of index setup if I recall correctly.
> >
> > Otis
> > Solr & ElasticSearch Support
> > http://sematext.com/
> > On May 24, 2013 10:26 AM, "Saikat Kanjilal"  wrote:
> >
> >> Hello Solr community folks,
> >> I am doing some investigative work around how to roll and manage indexes
> >> inside our solr configuration, to date I've come up with an architecture
> >> that separates a set of masters that are focused on writes and get
> >> replicated periodically and a set of slave shards strictly docused on
> >> reads, additionally for each master index the design contains partial
> >> purges which get performed on each of the slave shards as well as the
> >> master to keep the data current.   However the architecture seems a bit
> >> more complex than I'd like with a lot of moving pieces.  I was wondering if
> >> anyone has ever handled/designed an architecture around a "conveyor belt"
> >> or rolling window of indexes around n days of data and if there are best
> >> practices around this.  One thing I was thinking about was whether to keep
> >> a conveyor belt list of the slave shards and rotate them as needed and drop
> >> the master periodically and make its backup temporarily the master.
> >>
> >>
> >> Anyways would love to hear thoughts and usecases that are similar from the
> >> community.
> >>
> >> Regards

Why do FQs make my spelling suggestions so slow?

2013-05-28 Thread Andy Lester

I'm working on using spellcheck for giving suggestions, and collations
are giving me good results, but they turn out to be very slow if
my original query has any FQs in it.  We can do 100 maxCollationTries
in no time at all, but if there are FQs in the query, things get
very slow.  As maxCollationTries and the count of FQs increase,
things get very slow very quickly.

 1102050   100 MaxCollationTries
0FQs 8 9101110
1FQ 11   160   599  1597  1668
2FQs20   346  1163  3360  3361
3FQs29   474  1852  5039  5095
4FQs36   589  2463  6797  6807

All times are QTimes of ms.

See that top row?  With no FQs, 50 MaxCollationTries comes back
instantly.  Add just one FQ, though, and things go bad, and they
get worse as I add more of the FQs.  Also note that things seem to
level off at 100 MaxCollationTries.

Here's a query that I've been using as a test:

df=title_tracings_t&
fl=flrid,nodeid,title_tracings_t&
q=bagdad+AND+diaries+AND+-parent_tracings:(bagdad+AND+diaries)&
spellcheck.q=bagdad+AND+diaries&
rows=4&
wt=xml&
sort=popular_score+desc,+grouping+asc,+copyrightyear+desc,+flrid+asc&
spellcheck=true&
spellcheck.dictionary=direct&
spellcheck.onlyMorePopular=false&
spellcheck.count=15&
spellcheck.extendedResults=false&
spellcheck.collate=true&
spellcheck.maxCollations=10&
spellcheck.maxCollationTries=50&
spellcheck.collateExtendedResults=true&
spellcheck.alternativeTermCount=5&
spellcheck.maxResultsForSuggest=10&
debugQuery=off&
fq=((grouping:"1"+OR+grouping:"2"+OR+grouping:"3")+OR+solrtype:"N")&
fq=((item_source:"F"+OR+item_source:"B"+OR+item_source:"M")+OR+solrtype:"N")&
fq={!tag%3Dgrouping}((grouping:"1"+OR+grouping:"2")+OR+solrtype:"N")&
fq={!tag%3Dlanguagecode}(languagecode:"eng"+OR+solrtype:"N")&

The only thing that changes between tests is the value of
spellcheck.maxCollationTries and how many FQs are at the end.

Am I doing something wrong?  Do the collation internals not handle
FQs correctly?  The lookup/hit counts on filterCache seem to be
increasing just fine.  It will do N lookups, N hits, so I'm not
thinking that caching is the problem.

We'd really like to be able to use the spellchecker but the results
with only 10-20 maxCollationTries aren't nearly as good as if we
can bump that up to 100, but we can't afford the slow response time.
We also can't do without the FQs.

Thanks,
Andy


--
Andy Lester => a...@petdance.com => www.petdance.com => AIM:petdance

Re: Solr Composite Unique key from existing fields in schema

2013-05-28 Thread Jack Krupansky

The order in the ID should be purely dependent on the order of the field 
names in the processor configuration:


docid_s
userid_s

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran

Sent: Tuesday, May 28, 2013 2:54 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Composite Unique key from existing fields in schema

Jack,

No sure if this is the correct behaviour.
I set up updateRequestorPorcess chain as mentioned below, but looks like the 
compositeId that is generated is based on input order.


For example:
If my input comes in as
1
12345

I get the following compositeId1-12345.

If I reverse the input

12345

1
I get the following compositeId 12345-1 .


In this case the compositeId is not unique and I am getting duplicates.

Thanks,

Rishi.



-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Tue, May 28, 2013 12:07 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


You can do this by combining the builtin update processors.

Add this to your solrconfig:


 
   docid_s
   userid_s
   id
 
 
   id
   --
 
 
 


Add documents such as:

curl
"http://localhost:8983/solr/update?commit=true&update.chain=composite-id"; \
-H 'Content-type:application/json' -d '
[{"title": "Hello World",
 "docid_s": "doc-1",
 "userid_s": "user-1",
 "comments_ss": ["Easy", "Fast"]}]'

And get results like:

"title":["Hello World"],
"docid_s":"doc-1",
"userid_s":"user-1",
"comments_ss":["Easy",
 "Fast"],
"id":"doc-1--user-1",

Add as many fields in whatever order you want using "source" in the clone
update processor, and pick your composite key field name as well. And set
the delimiter string as well in the concat update processor.

I managed to reverse the field order from what you requested (userid,
docid).

I used the standard Solr example schema, so I used dynamic fields for the
two ids, but use your own field names.

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran

Sent: Tuesday, May 28, 2013 11:12 AM
To: solr-user@lucene.apache.org
Subject: Solr Composite Unique key from existing fields in schema

Hi All,

Historically we have used a single field in our schema as a uniqueKey.

 
 
docid

Wanted to change this to a composite key something like
userid-docid.
I know I can auto generate compositekey at document insert time, using
custom code to generate a new field, but wanted to know if there was an
inbuilt SOLR mechanism of doing this. That would prevent us from creating
and storing an extra field.

Thanks,

Rishi.

Re: Nested Facets and distributed shard system.

2013-05-28 Thread vibhoreng04

Hi Erick  and Markus,

Any Idea on this ? can we resolve this by group by queries?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nested-Facets-and-distributed-shard-system-tp4065847p4066583.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Composite Unique key from existing fields in schema

2013-05-28 Thread Rishi Easwaran

I thought the same, but that doesn't seem to be the case.


 

 

 

-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Tue, May 28, 2013 3:32 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


The order in the ID should be purely dependent on the order of the field 
names in the processor configuration:

docid_s
userid_s

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran
Sent: Tuesday, May 28, 2013 2:54 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Composite Unique key from existing fields in schema

Jack,

No sure if this is the correct behaviour.
I set up updateRequestorPorcess chain as mentioned below, but looks like the 
compositeId that is generated is based on input order.

For example:
If my input comes in as
1
12345

I get the following compositeId1-12345.

If I reverse the input

12345

1
I get the following compositeId 12345-1 .


In this case the compositeId is not unique and I am getting duplicates.

Thanks,

Rishi.



-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Tue, May 28, 2013 12:07 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


You can do this by combining the builtin update processors.

Add this to your solrconfig:


  
docid_s
userid_s
id
  
  
id
--
  
  
  


Add documents such as:

curl
"http://localhost:8983/solr/update?commit=true&update.chain=composite-id"; \
-H 'Content-type:application/json' -d '
[{"title": "Hello World",
  "docid_s": "doc-1",
  "userid_s": "user-1",
  "comments_ss": ["Easy", "Fast"]}]'

And get results like:

"title":["Hello World"],
"docid_s":"doc-1",
"userid_s":"user-1",
"comments_ss":["Easy",
  "Fast"],
"id":"doc-1--user-1",

Add as many fields in whatever order you want using "source" in the clone
update processor, and pick your composite key field name as well. And set
the delimiter string as well in the concat update processor.

I managed to reverse the field order from what you requested (userid,
docid).

I used the standard Solr example schema, so I used dynamic fields for the
two ids, but use your own field names.

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran
Sent: Tuesday, May 28, 2013 11:12 AM
To: solr-user@lucene.apache.org
Subject: Solr Composite Unique key from existing fields in schema

Hi All,

Historically we have used a single field in our schema as a uniqueKey.

  
  
docid

Wanted to change this to a composite key something like
userid-docid.
I know I can auto generate compositekey at document insert time, using
custom code to generate a new field, but wanted to know if there was an
inbuilt SOLR mechanism of doing this. That would prevent us from creating
and storing an extra field.

Thanks,

Rishi.

SOLR 4.3.0 - How to make fq optional?

2013-05-28 Thread bbarani

I am using the SOLR geospatial capabilities for filtering the results based
on the particular radius (something like below).. I have added the below fq
query in solrconfig and passing the latitude and longitude information
dynamically..

select?q=firstName:john&fq={!bbox%20sfield=geo%20pt=40.279392,-81.85891723%20d=10}


_query_:"{firstName=$firstName}"



_query_:"{!bbox pt=$fps_latlong sfield=geo d=$fps_dist}"


Now when I pass the latitude and longitude data, the query works fine but
whenever I dont pass the latitude / longitude data it throws exception.. Is
there a way to make fq optional? Is there a way to ignore spatial queries
when the co ordinates are not passed? Looking for something like dismax,
that doesnt throw any exceptions...





--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-4-3-0-How-to-make-fq-optional-tp4066592.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Composite Unique key from existing fields in schema

2013-05-28 Thread Jack Krupansky


The TL;DR response: Try this:


 
   userid_s
   id
 
 
   docid_s
   id
 
 
   id
   --
 
 
 


That will assure that the userid gets processed before the docid.

I'll have to review the contract for CloneFieldUpdateProcessorFactory to see 
what is or ain't guaranteed when there are multiple input fields - whether 
this is a bug or a feature or simply undefined.


-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran

Sent: Tuesday, May 28, 2013 3:54 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Composite Unique key from existing fields in schema

I thought the same, but that doesn't seem to be the case.








-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Tue, May 28, 2013 3:32 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


The order in the ID should be purely dependent on the order of the field
names in the processor configuration:

docid_s
userid_s

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran

Sent: Tuesday, May 28, 2013 2:54 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Composite Unique key from existing fields in schema

Jack,

No sure if this is the correct behaviour.
I set up updateRequestorPorcess chain as mentioned below, but looks like the
compositeId that is generated is based on input order.

For example:
If my input comes in as
1
12345

I get the following compositeId1-12345.

If I reverse the input

12345

1
I get the following compositeId 12345-1 .


In this case the compositeId is not unique and I am getting duplicates.

Thanks,

Rishi.



-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Tue, May 28, 2013 12:07 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


You can do this by combining the builtin update processors.

Add this to your solrconfig:


 
   docid_s
   userid_s
   id
 
 
   id
   --
 
 
 


Add documents such as:

curl
"http://localhost:8983/solr/update?commit=true&update.chain=composite-id"; \
-H 'Content-type:application/json' -d '
[{"title": "Hello World",
 "docid_s": "doc-1",
 "userid_s": "user-1",
 "comments_ss": ["Easy", "Fast"]}]'

And get results like:

"title":["Hello World"],
"docid_s":"doc-1",
"userid_s":"user-1",
"comments_ss":["Easy",
 "Fast"],
"id":"doc-1--user-1",

Add as many fields in whatever order you want using "source" in the clone
update processor, and pick your composite key field name as well. And set
the delimiter string as well in the concat update processor.

I managed to reverse the field order from what you requested (userid,
docid).

I used the standard Solr example schema, so I used dynamic fields for the
two ids, but use your own field names.

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran

Sent: Tuesday, May 28, 2013 11:12 AM
To: solr-user@lucene.apache.org
Subject: Solr Composite Unique key from existing fields in schema

Hi All,

Historically we have used a single field in our schema as a uniqueKey.

 
 
docid

Wanted to change this to a composite key something like
userid-docid.
I know I can auto generate compositekey at document insert time, using
custom code to generate a new field, but wanted to know if there was an
inbuilt SOLR mechanism of doing this. That would prevent us from creating
and storing an extra field.

Thanks,

Rishi.

RE: Why do FQs make my spelling suggestions so slow?

2013-05-28 Thread Dyer, James

Andy,

What are the QTimes for the 0fq,1fq,2fq,4fq & 4fq cases with spellcheck 
entirely turned off?  Is it about (or a little more than) half the total when 
maxCollationTries=1 ?  Also, with the varying # of fq's, how many collation 
tries does it take to get 10 collations?

Possibly, a better way to test this is to set maxCollations = 
maxCollationTries.  The reason is that it quits "trying" once it finds 
"maxCollations", so if with 0fq's, lots of combinations can generate hits and 
it doesn't need to try very many to get to 10.  But with more fq's, fewer 
collations will pan out so now it is trying more up to 100 before (if ever) it 
gets to 10.

I would predict that for each "try" it has to do (and you can force this by 
setting maxCollations = maxCollationTries), qtime will grow linerally per try.  
(I'm assuming you have all non-search components like faceting turned off).  So 
say with 2fq's it takes 10ms for the query to complete with spellcheck off, and 
20ms with "maxCollation = maxCollationTries = 1", then it will take about 110ms 
with "maxCollation = maxCollationTries = 10".

Now if you are finding that with a certain # of fq's, qtime with spellcheck off 
is, for instance, 2ms, 1 try is 10ms, 2 tries is 19ms, etc, then this is more 
than liner growth.  In this case we would need to look at how spell check 
applies fq's and see if there is a bug with it using the cache correctly.

But I think you're just setting maxCollationTries too high.  You're asking it 
to do too much work in trying teens of combinations.  Really, this feature was 
designed to spellcheck and not suggest.  But see 
https://issues.apache.org/jira/browse/SOLR-3240 , which is committed to the 4x 
branch for inclusion in an eventual 4.4 release.  This will make the time to do 
collation tries growth less than linear, possibly making it more suitable for 
suggest.

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: Andy Lester [mailto:a...@petdance.com] 
Sent: Tuesday, May 28, 2013 2:29 PM
To: solr-user@lucene.apache.org
Subject: Why do FQs make my spelling suggestions so slow?

I'm working on using spellcheck for giving suggestions, and collations
are giving me good results, but they turn out to be very slow if
my original query has any FQs in it.  We can do 100 maxCollationTries
in no time at all, but if there are FQs in the query, things get
very slow.  As maxCollationTries and the count of FQs increase,
things get very slow very quickly.

 1102050   100 MaxCollationTries
0FQs 8 9101110
1FQ 11   160   599  1597  1668
2FQs20   346  1163  3360  3361
3FQs29   474  1852  5039  5095
4FQs36   589  2463  6797  6807

All times are QTimes of ms.

See that top row?  With no FQs, 50 MaxCollationTries comes back
instantly.  Add just one FQ, though, and things go bad, and they
get worse as I add more of the FQs.  Also note that things seem to
level off at 100 MaxCollationTries.

Here's a query that I've been using as a test:

df=title_tracings_t&
fl=flrid,nodeid,title_tracings_t&
q=bagdad+AND+diaries+AND+-parent_tracings:(bagdad+AND+diaries)&
spellcheck.q=bagdad+AND+diaries&
rows=4&
wt=xml&
sort=popular_score+desc,+grouping+asc,+copyrightyear+desc,+flrid+asc&
spellcheck=true&
spellcheck.dictionary=direct&
spellcheck.onlyMorePopular=false&
spellcheck.count=15&
spellcheck.extendedResults=false&
spellcheck.collate=true&
spellcheck.maxCollations=10&
spellcheck.maxCollationTries=50&
spellcheck.collateExtendedResults=true&
spellcheck.alternativeTermCount=5&
spellcheck.maxResultsForSuggest=10&
debugQuery=off&
fq=((grouping:"1"+OR+grouping:"2"+OR+grouping:"3")+OR+solrtype:"N")&
fq=((item_source:"F"+OR+item_source:"B"+OR+item_source:"M")+OR+solrtype:"N")&
fq={!tag%3Dgrouping}((grouping:"1"+OR+grouping:"2")+OR+solrtype:"N")&
fq={!tag%3Dlanguagecode}(languagecode:"eng"+OR+solrtype:"N")&

The only thing that changes between tests is the value of
spellcheck.maxCollationTries and how many FQs are at the end.

Am I doing something wrong?  Do the collation internals not handle
FQs correctly?  The lookup/hit counts on filterCache seem to be
increasing just fine.  It will do N lookups, N hits, so I'm not
thinking that caching is the problem.

We'd really like to be able to use the spellchecker but the results
with only 10-20 maxCollationTries aren't nearly as good as if we
can bump that up to 100, but we can't afford the slow response time.
We also can't do without the FQs.

Thanks,
Andy


--
Andy Lester => a...@petdance.com => www.petdance.com => AIM:petdance

Re: SOLR 4.3.0 - How to make fq optional?

2013-05-28 Thread David Smiley (@MITRE.org)

Your client needs to know to submit the proper filter query conditionally. 
It's not really a spatial issue, and I disagree with the idea to make bbox
(and all other query parsers for that matter) do nothing if not given an
expected input.

~ David


bbarani wrote
> I am using the SOLR geospatial capabilities for filtering the results
> based on the particular radius (something like below).. I have added the
> below fq query in solrconfig and passing the latitude and longitude
> information dynamically but I am hardcoding the dynamic query in
> solrconfig.xml..
> 
> select?q=firstName:john&fq={!bbox%20sfield=geo%20pt=40.279392,-81.85891723%20d=10}
> 
> 
> 
> _query_:"{firstName=$firstName}"
> 
> 
> 
> 
> _query_:"{!bbox pt=$fps_latlong sfield=geo d=$fps_dist}"
> 
> 
> Now when I pass the latitude and longitude data, the query works fine but
> whenever I dont pass the latitude / longitude data it throws exception..
> Is there a way to make fq optional? Is there a way to ignore spatial
> queries when the co ordinates are not passed? Looking for something like
> dismax, that doesnt throw any exceptions...





-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-4-3-0-How-to-make-fq-optional-tp4066592p4066604.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Note on The Book

2013-05-28 Thread Jack Krupansky

We'll have a blog for the book. We hope to have a first 
raw/rough/partial/draft published as an e-book in maybe 10 days to 2 weeks. 
As soon as we get that process under control, we'll start the blog. I'll 
keep your email on file and keep you posted.


-- Jack Krupansky

-Original Message- 
From: Swati Swoboda

Sent: Tuesday, May 28, 2013 1:36 PM
To: solr-user@lucene.apache.org
Subject: RE: Note on The Book

I'd definitely prefer the spiral bound as well. E-books are great and your 
draft version seems very reasonably priced (aka I would definitely get it).


Really looking forward to this. Is there a separate mailing list / etc. for 
the book for those who would like to receive updates on the status of the 
book?


Thanks

Swati Swoboda
Software Developer - Igloo Software
+1.519.489.4120  sswob...@igloosoftware.com

Bring back Cake Fridays – watch a video you’ll actually like
http://vimeo.com/64886237


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Thursday, May 23, 2013 7:15 PM
To: solr-user@lucene.apache.org
Subject: Note on The Book

To those of you who may have heard about the Lucene/Solr book that I and two 
others are writing on Lucene and Solr, some bad and good news. The bad news: 
The book contract with O’Reilly has been canceled. The good news: I’m going 
to proceed with self-publishing (possibly on Lulu or even Amazon) a somewhat 
reduced scope Solr-only Reference Guide (with hints of Lucene). The scope of 
the previous effort was too great, even for O’Reilly – a book larger than 
800 pages (or even 600) that was heavy on reference and lighter on “guide” 
just wasn’t fitting in with their traditional “guide” model. In truth, Solr 
is just too complex for a simple guide that covers it all, let alone Lucene 
as well.


I’ll announce more details in the coming weeks, but I expect to publish an 
e-book-only version of the book, focused on Solr reference (and plenty of 
guide as well), possibly on Lulu, plus eventually publish 4-8 individual 
print volumes for people who really want the paper. One model I may pursue 
is to offer the current, incomplete, raw, rough, draft as a $7.99 e-book, 
with the promise of updates every two weeks or a month as new and revised 
content and new releases of Solr become available. Maybe the individual 
e-book volumes would be $2 or $3. These are just preliminary ideas. Feel 
free to let me know what seems reasonable or excessive.


For paper: Do people really want perfect bound, or would you prefer spiral 
bound that lies flat and folds back easily? I suppose we could offer both – 
which should be considered “premium”?


I’ll announce more details next week. The immediate goal will be to get the 
“raw rough draft” available to everyone ASAP.


For those of you who have been early reviewers – your effort will not have 
been in vain. I have all your comments and will address them over the next 
month or two or three.


Just for some clarity, the existing Solr Wiki and even the recent 
contribution of the LucidWorks Solr Reference to Apache really are still 
great contributions to general knowledge about Solr, but the book is 
intended to go much deeper into detail, especially with loads of examples 
and a lot more narrative guide. For example, the book has a complete list of 
the analyzer filters, each with a clean one-liner description. Ditto for 
every parameter (although I would note that the LucidWorks Solr Reference 
does a decent job of that as well.) Maybe, eventually, everything in the 
book COULD (and will) be integrated into the standard Solr doc, but until 
then, a single, integrated reference really is sorely needed. And, the book 
has a lot of narrative guide and walking through examples as well. Over 
time, I’m sure both will evolve. And just to be clear, the book is not a 
simple repurposing of the Solr wiki content – EVERY description of 
everything has been written fresh, from scratch. So, for example, analyzer 
filters get both short one-liner summary descriptions as well as more 
detailed descriptions, plus formal attribute specifications and numerous 
examples, including sample input and outputs (the LucidWorks Solr Reference 
does a better job with examples as well.)


The book has been written in parallel with branch_4x and that will continue.

-- Jack Krupansky

Re: SOLR 4.3.0 - How to make fq optional?

2013-05-28 Thread bbarani

David, I felt like there should be a flag with which we can either throw the
error message or do nothing in case of bad inputs.. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-4-3-0-How-to-make-fq-optional-tp4066592p4066610.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Why do FQs make my spelling suggestions so slow?

2013-05-28 Thread Andy Lester

Thanks for looking at this.

> What are the QTimes for the 0fq,1fq,2fq,4fq & 4fq cases with spellcheck 
> entirely turned off?  Is it about (or a little more than) half the total when 
> maxCollationTries=1 ?

With spellcheck off I get 8ms for 4fq query.


>  Also, with the varying # of fq's, how many collation tries does it take to 
> get 10 collations?

I don't know.  How can I tell?


> Possibly, a better way to test this is to set maxCollations = 
> maxCollationTries.  The reason is that it quits "trying" once it finds 
> "maxCollations", so if with 0fq's, lots of combinations can generate hits and 
> it doesn't need to try very many to get to 10.  But with more fq's, fewer 
> collations will pan out so now it is trying more up to 100 before (if ever) 
> it gets to 10.

It does just fine doing 100 collations so long as there are no FQs.  It seems 
to me that the FQs are taking an inordinate amount of extra time.  100 
collations in (roughly) the same amount of time as a single collation, so long 
as there are no FQs.  Why are the FQs such a drag on the collation process?


> (I'm assuming you have all non-search components like faceting turned off).

Yes, definitely.


>  So say with 2fq's it takes 10ms for the query to complete with spellcheck 
> off, and 20ms with "maxCollation = maxCollationTries = 1", then it will take 
> about 110ms with "maxCollation = maxCollationTries = 10".

I can do maxCollation = maxCollationTries = 100 and it comes back in 14ms, so 
long as I have FQs off.  Add a single FQ and it becomes 13499ms.

I can do maxCollation = maxCollationTries = 1000 and it comes back in 45ms, so 
long as I have FQs off.  Add a single FQ and it becomes 62038ms.


> But I think you're just setting maxCollationTries too high.  You're asking it 
> to do too much work in trying teens of combinations.

The results I get back with 100 tries are about twice as many as I get with 10 
tries.  That's a big difference to the user where it's trying to figure 
misspelled phrases.

Andy

--
Andy Lester => a...@petdance.com => www.petdance.com => AIM:petdance

Re: SOLR 4.3.0 - How to make fq optional?

2013-05-28 Thread Erik Hatcher

I imagine the new "switch" query parser could help here somehow.  

Erik

On May 28, 2013, at 16:43, "David Smiley (@MITRE.org)"  
wrote:

> Your client needs to know to submit the proper filter query conditionally. 
> It's not really a spatial issue, and I disagree with the idea to make bbox
> (and all other query parsers for that matter) do nothing if not given an
> expected input.
> 
> ~ David
> 
> 
> bbarani wrote
>> I am using the SOLR geospatial capabilities for filtering the results
>> based on the particular radius (something like below).. I have added the
>> below fq query in solrconfig and passing the latitude and longitude
>> information dynamically but I am hardcoding the dynamic query in
>> solrconfig.xml..
>> 
>> select?q=firstName:john&fq={!bbox%20sfield=geo%20pt=40.279392,-81.85891723%20d=10}
>> 
>> 
>> 
>>_query_:"{firstName=$firstName}"
>> 
>> 
>> 
>> 
>>_query_:"{!bbox pt=$fps_latlong sfield=geo d=$fps_dist}"
>> 
>> 
>> Now when I pass the latitude and longitude data, the query works fine but
>> whenever I dont pass the latitude / longitude data it throws exception..
>> Is there a way to make fq optional? Is there a way to ignore spatial
>> queries when the co ordinates are not passed? Looking for something like
>> dismax, that doesnt throw any exceptions...
> 
> 
> 
> 
> 
> -
> Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SOLR-4-3-0-How-to-make-fq-optional-tp4066592p4066604.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Nested Facets and distributed shard system.

2013-05-28 Thread Jason Hellman

You have mentioned Pivot Facets, but have you looked at the Path Hierarchy 
Tokenizer Factory:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PathHierarchyTokenizerFactory

This matches your use case, as best as I understand it.

Jason


On May 28, 2013, at 12:47 PM, vibhoreng04  wrote:

> Hi Erick  and Markus,
> 
> Any Idea on this ? can we resolve this by group by queries?
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Nested-Facets-and-distributed-shard-system-tp4065847p4066583.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: SOLR 4.3.0 - How to make fq optional?

2013-05-28 Thread bbarani

Erik,

I am trying to enable / disable a part of fq based on a particular value
passed from the query.

For Ex: If I have the value for the keyword where in the query then I would
like to enable this fq, else just ignore it.. 

select?where="New york,NY"

Enable only when where has some value. (I get the value passed in query
inside fq - using $where)

I need something like..

*if($where!=null){*

{!bbox pt=$fps_latlong sfield=geo d=$fps_dist}
 
*}*

 Is it possible to achieve this using the switch query parser?

Thanks,
BB



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-4-3-0-How-to-make-fq-optional-tp4066592p4066624.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Composite Unique key from existing fields in schema

2013-05-28 Thread Rishi Easwaran

Thanks Jack, That fixed it and guarantees the order.

As far as I can tell SOLR cloud 4.2.1 needs a uniquekey defined in its schema, 
or I get an exception.
SolrCore Initialization Failures
 * testCloud2_shard1_replica1: 
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
QueryElevationComponent requires the schema to have a uniqueKeyField. 

Now that I have an autogenerated composite-id, it has to become a part of my 
schema as uniquekey for SOLR cloud to work. 
  
  
  
compositeId

Is there a way to avoid compositeId field being defined in my schema.xml, would 
like to avoid the overhead of storing this field in my index.

Thanks,

Rishi.


 

 

 

-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Tue, May 28, 2013 4:33 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


The TL;DR response: Try this:


  
userid_s
id
  
  
docid_s
id
  
  
id
--
  
  
  


That will assure that the userid gets processed before the docid.

I'll have to review the contract for CloneFieldUpdateProcessorFactory to see 
what is or ain't guaranteed when there are multiple input fields - whether 
this is a bug or a feature or simply undefined.

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran
Sent: Tuesday, May 28, 2013 3:54 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Composite Unique key from existing fields in schema

I thought the same, but that doesn't seem to be the case.








-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Tue, May 28, 2013 3:32 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


The order in the ID should be purely dependent on the order of the field
names in the processor configuration:

docid_s
userid_s

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran
Sent: Tuesday, May 28, 2013 2:54 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Composite Unique key from existing fields in schema

Jack,

No sure if this is the correct behaviour.
I set up updateRequestorPorcess chain as mentioned below, but looks like the
compositeId that is generated is based on input order.

For example:
If my input comes in as
1
12345

I get the following compositeId1-12345.

If I reverse the input

12345

1
I get the following compositeId 12345-1 .


In this case the compositeId is not unique and I am getting duplicates.

Thanks,

Rishi.



-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Tue, May 28, 2013 12:07 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


You can do this by combining the builtin update processors.

Add this to your solrconfig:


  
docid_s
userid_s
id
  
  
id
--
  
  
  


Add documents such as:

curl
"http://localhost:8983/solr/update?commit=true&update.chain=composite-id"; \
-H 'Content-type:application/json' -d '
[{"title": "Hello World",
  "docid_s": "doc-1",
  "userid_s": "user-1",
  "comments_ss": ["Easy", "Fast"]}]'

And get results like:

"title":["Hello World"],
"docid_s":"doc-1",
"userid_s":"user-1",
"comments_ss":["Easy",
  "Fast"],
"id":"doc-1--user-1",

Add as many fields in whatever order you want using "source" in the clone
update processor, and pick your composite key field name as well. And set
the delimiter string as well in the concat update processor.

I managed to reverse the field order from what you requested (userid,
docid).

I used the standard Solr example schema, so I used dynamic fields for the
two ids, but use your own field names.

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran
Sent: Tuesday, May 28, 2013 11:12 AM
To: solr-user@lucene.apache.org
Subject: Solr Composite Unique key from existing fields in schema

Hi All,

Historically we have used a single field in our schema as a uniqueKey.

  
  
docid

Wanted to change this to a composite key something like
userid-docid.
I know I can auto generate compositekey at document insert time, using
custom code to generate a new field, but wanted to know if there was an
inbuilt SOLR mechanism of doing this. That would prevent us from creating
and storing an extra field.

Thanks,

Rishi.

Re: SOLR 4.3.0 - How to make fq optional?

2013-05-28 Thread Chris Hostetter


: David, I felt like there should be a flag with which we can either throw the
: error message or do nothing in case of bad inputs.. 

As erik alluded to in his response, you should be able to configure an 
"appended" fq using the "switch" QParserPlugin to get something like what you 
are 
describing, by taking advantage of the "default" behavior.

I think you'd want something along the lines of...


  50 


  {!switch case='*:*' default=$fq_bbox v=$fps_latlong}


  {!bbox pt=$fps_latlong sfield=geo d=$fps_dist}


The result of something like that should be:

 * if a user specifies a non blank fps_latlong param, the the query will 
be filtered using a bbox based on that center and the specified  fps_dist 
is (or the default fps_dist if the user didn't specify one)
 * if the user does not specify an fps_latlong param, or the fpls_latlong 
param is blank, then no effective filtering is done (the filter matches 
all docs)

Here's an equivilent example using the Solr 4.2.1 examle data and 
configs...

fps_latlong specified, matches a single document in the radius..

http://localhost:8983/solr/select?q=*:*&fps_dist=100&fps_bbox={!bbox%20pt=$fps_latlong%20sfield=store%20d=$fps_dist}&fq={!switch%20case=%27*:*%27%20default=$fps_bbox%20v=$fps_latlong}&fps_latlong=35.0752,-97.032


fps_latlong not specified, or blank, matches all docs...

http://localhost:8983/solr/select?q=*:*&fps_dist=100&fps_bbox={!bbox%20pt=$fps_latlong%20sfield=store%20d=$fps_dist}&fq={!switch%20case=%27*:*%27%20default=$fps_bbox%20v=$fps_latlong}&fps_latlong=
http://localhost:8983/solr/select?q=*:*&fps_dist=100&fps_bbox={!bbox%20pt=$fps_latlong%20sfield=store%20d=$fps_dist}&fq={!switch%20case=%27*:*%27%20default=$fps_bbox%20v=$fps_latlong}&fps_latlong=+++
http://localhost:8983/solr/select?q=*:*&fps_dist=100&fps_bbox={!bbox%20pt=$fps_latlong%20sfield=store%20d=$fps_dist}&fq={!switch%20case=%27*:*%27%20default=$fps_bbox%20v=$fps_latlong}


More info...

https://lucene.apache.org/solr/4_3_0/solr-core/org/apache/solr/search/SwitchQParserPlugin.html
http://searchhub.org/2013/02/20/custom-solr-request-params/


-Hoss

Re: split document or not

2013-05-28 Thread Hard_Club

Thanks, Alexandre.

But I need to know in which paragraph is matched the request. I need it
because paragraphs are binded to some extra data that I need to output on
result page. So I need to know paragraphs is'd. How to bind such attribute
to multivalued field?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/split-document-or-not-tp4066170p4066629.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: split document or not

2013-05-28 Thread Jason Hellman

You may wish to explore the concept of using the Result Grouping (Field 
Collapsing) feature in which your paragraphs are individual documents that 
share a field to group them by (the ID of the document/book/article/whatever).

http://wiki.apache.org/solr/FieldCollapsing

This will net you absolutely isolated results for paragraphs, and give you a 
great deal of flexibility on how to query the results in cases where you do or 
do not need them grouped.

Jason

On May 28, 2013, at 3:10 PM, Hard_Club  wrote:

> Thanks, Alexandre.
> 
> But I need to know in which paragraph is matched the request. I need it
> because paragraphs are binded to some extra data that I need to output on
> result page. So I need to know paragraphs is'd. How to bind such attribute
> to multivalued field?
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/split-document-or-not-tp4066170p4066629.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: SOLR 4.3.0 - How to make fq optional?

2013-05-28 Thread bbarani

Hoss, you read my mind Thanks a lott for your awesome
explanation! You rock!!!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-4-3-0-How-to-make-fq-optional-tp4066592p4066630.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: [blog post] Automatically Acquiring Synonym Knowledge from Wikipedia

2013-05-28 Thread Koji Sekiguchi


Hi Rajesh,

Thanks!
I'm planning to open an NLP tool kit for Lucene, and the tool kit will include
the following synonym library.

koji

(13/05/28 14:12), Rajesh Nikam wrote:

Hello Koji,

This is seems pretty useful post on how to create synonyms file.
Thanks a lot for sharing this !

Have you shared source code / jar for the same so at it could be used ?

Thanks,
Rajesh



On Mon, May 27, 2013 at 8:44 PM, Koji Sekiguchi  wrote:


Hello,

Sorry for cross post. I just wanted to announce that I've written a blog
post on
how to create synonyms.txt file automatically from Wikipedia:


http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Hope that the article gives someone a good experience!

koji
--

http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html






--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

Re: Solr Composite Unique key from existing fields in schema

2013-05-28 Thread Jack Krupansky

Great. And I did verify that the field order cannot be guaranteed by a 
single CloneFieldUpdateProcessorFactory with multiple field names - the 
underlying code iterates over the input values, checks the field selector 
for membership and then immediately adds to the output, so changing the 
input order will change the output order. Also, field names are stored in a 
HashSet anyway, which would tend to scramble their order.


-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran

Sent: Tuesday, May 28, 2013 6:01 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Composite Unique key from existing fields in schema

Thanks Jack, That fixed it and guarantees the order.

As far as I can tell SOLR cloud 4.2.1 needs a uniquekey defined in its 
schema, or I get an exception.

SolrCore Initialization Failures
* testCloud2_shard1_replica1: 
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
QueryElevationComponent requires the schema to have a uniqueKeyField.


Now that I have an autogenerated composite-id, it has to become a part of my 
schema as uniquekey for SOLR cloud to work.
 multiValued="false" required="true"/>
 multiValued="false" required="true"/>
multiValued="false" required="true"/>

compositeId

Is there a way to avoid compositeId field being defined in my schema.xml, 
would like to avoid the overhead of storing this field in my index.


Thanks,

Rishi.








-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Tue, May 28, 2013 4:33 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


The TL;DR response: Try this:


 
   userid_s
   id
 
 
   docid_s
   id
 
 
   id
   --
 
 
 


That will assure that the userid gets processed before the docid.

I'll have to review the contract for CloneFieldUpdateProcessorFactory to see
what is or ain't guaranteed when there are multiple input fields - whether
this is a bug or a feature or simply undefined.

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran

Sent: Tuesday, May 28, 2013 3:54 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Composite Unique key from existing fields in schema

I thought the same, but that doesn't seem to be the case.








-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Tue, May 28, 2013 3:32 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


The order in the ID should be purely dependent on the order of the field
names in the processor configuration:

docid_s
userid_s

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran

Sent: Tuesday, May 28, 2013 2:54 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Composite Unique key from existing fields in schema

Jack,

No sure if this is the correct behaviour.
I set up updateRequestorPorcess chain as mentioned below, but looks like the
compositeId that is generated is based on input order.

For example:
If my input comes in as
1
12345

I get the following compositeId1-12345.

If I reverse the input

12345

1
I get the following compositeId 12345-1 .


In this case the compositeId is not unique and I am getting duplicates.

Thanks,

Rishi.



-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Tue, May 28, 2013 12:07 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


You can do this by combining the builtin update processors.

Add this to your solrconfig:


 
   docid_s
   userid_s
   id
 
 
   id
   --
 
 
 


Add documents such as:

curl
"http://localhost:8983/solr/update?commit=true&update.chain=composite-id"; \
-H 'Content-type:application/json' -d '
[{"title": "Hello World",
 "docid_s": "doc-1",
 "userid_s": "user-1",
 "comments_ss": ["Easy", "Fast"]}]'

And get results like:

"title":["Hello World"],
"docid_s":"doc-1",
"userid_s":"user-1",
"comments_ss":["Easy",
 "Fast"],
"id":"doc-1--user-1",

Add as many fields in whatever order you want using "source" in the clone
update processor, and pick your composite key field name as well. And set
the delimiter string as well in the concat update processor.

I managed to reverse the field order from what you requested (userid,
docid).

I used the standard Solr example schema, so I used dynamic fields for the
two ids, but use your own field names.

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran

Sent: Tuesday, May 28, 2013 11:12 AM
To: solr-user@lucene.apache.org
Subject: Solr Composite Unique key from existing fields in schema

Hi All,

Historically we have used a single field in our schema as a uniqueKey.

 
 
docid

Wanted to change this to a composite key something like
userid-docid.
I know I can auto generate compositekey at document insert time, using
custom code to generate a new field, but wanted to know if there was an
inbuilt SOLR mechanism of doing this. That would prevent us from creating
and storing an extra field.

Thanks,

Rishi.

Re: Keeping a rolling window of indexes around solr

2013-05-28 Thread Chris Hostetter


: This is kind of the approach used by elastic search , if I'm not using 
: solrcloud will I be able to use shard aliasing, also with this approach 
: how would replication work, is it even needed?

you haven't said much about hte volume of data you expect to deal with, 
nor have you really explained what types of queries you intend to do -- 
ie: you said you were intersted in a "rolling window of indexes
around n days of data" but you never clarified why you think a 
rolling window of indexes would be useful to you or how exactly you would 
use it.

The primary advantage of sharding by date is if you know that a large 
percentage of your queries are only going to be within a small range of 
time, and therefore you can optimize those requests to only hit the shards 
neccessary to satisfy that small windo of time.

if the majority of requests are going to be across your entire "n days" of 
data, then date based sharding doesn't really help you -- you can just use 
arbitrary (randomized) sharding using periodic deleteByQuery commands to 
purge anything older then N days.  Query the whole collection by default, 
and add a filter query if/when you want to restrict your search to only a 
narrow date range of documents.

this is the same general approach you would use on a non-distributed / 
non-SolrCloud setup if you just had a single collection on a single master 
replicated to some number of slaves for horizontal scaling.


-Hoss

Re: Solr/Lucene Analayzer That Writes To File

2013-05-28 Thread Chris Hostetter


: I want to use Solr for an academical research. One step of my purpose is I
: want to store tokens in a file (I will store it at a database later) and I

you could absolutely write a java program which access the analyzers 
directly nad does whatever you want with the results of analysing a piece 
of text that you feed in.   

Alternatively, you could use something like the 
FieldAnalysisRequestHandler in solr, so that you could have an arbitrary 
client send data to solr asking it to analyze it for you and break it down 
into tokens, per your schema.xml...

http://localhost:8983/solr/collection1/analysis/field?analysis.fieldvalue=The%20quick%20brown%20fox%20jumped%20over%20the%20lazy%20dog&analysis.fieldtype=text_en&wt=json&indent=true

(this is exactly how the Analysis page in the admin UI works, the 
javascript powering htat page hits this same URL)

https://lucene.apache.org/solr/4_3_0/solr-core/org/apache/solr/handler/FieldAnalysisRequestHandler.html


-Hoss

Re: SOLR 4.3.0 - How to make fq optional?

2013-05-28 Thread Chris Hostetter


: As erik alluded to in his response, you should be able to configure an 
: "appended" fq using the "switch" QParserPlugin to get something like what you 
are 
: describing, by taking advantage of the "default" behavior.

I've updated the javadocs with 2 additiona examples inspired by this 
thread..

http://svn.apache.org/viewvc?view=revision&revision=1487166


-Hoss

Re: Restaurant availability from database

2013-05-28 Thread Chris Hostetter


: I've created a custom ValueSourceParser and ValueSource that retrieve the
: availability information from a MySQL database. An example query is as
: follows.
: 
: 
http://localhost:8983/solr/collection1/select?q=restaurant_id:*&fl=*,available:availability(2013-05-23,
: 2, 1700, 2359)
: 
: This results in a psuedo (boolean) field "available" per document result and
: this works as expected. But my problem is that I also need the total number
: of available restaurants.

1) "restaurant_id:*" is not doing what you think it is doing, use "*:*" or 
add an "is_restaurant" boolean field and query on that instead and you 
will probably discover that your queries for all are docs (or all 
restaurants) get much much faster.

2) if you've already built a custom ValueSourceParser that you're really 
happy with, and you just want to filter your Solr results based on the 
output of that custom ValueSource, you can do so by leveraging the frange 
QParser.  If your custom value source returns a boolean, then you just 
have to me a bit tricky with the function range you ask for...

 fq={!frange cache=false cost=1000 
l=1}if(availability(2013-05-23,2,1700,2359),5,0)

A few things to note in this example:

a) i'm using the if() function to map true to "5" (arbitray) and false to 
"0" (also arbitrary) and then filtering to only match documents whose 
value is "1" (arbitrary) or higher ... you can pick any values you want

b) unlike using your custom value source in the "fl" when used in an fq, 
your ValueSouce function will be called for a *lot* of documents -- so you 
probably ant to batch request the availability when the ValueSourceParser 
is called, for fast lookup on each individual document.

c) i've specified cache=false and a high cost param on the frange to 
ensure that the custom ValueSource is only ever asked about the 
availability of documents that already match our main query and any other 
filter queries.

3) if you don't want to filter or otherwise modify the result set by the 
results of your custom ValueSource, you just need the count of available 
documents matching your main query (independent of the numFound count of 
docs matching your main query), you can use the same technique in a 
"facet.query".


-Hoss

Re: Keeping a rolling window of indexes around solr

2013-05-28 Thread Saikat Kanjilal

Volume of data:
1 log insert every 30 seconds, queries done sporadically asynchronously every 
so often at a much lower frequency every few days

Also the majority of the requests are indeed going to be within a splice of 
time (typically hours or at most a few days)

Type of queries:
Keyword or termsearch
Search by guid (or id as known in the solr world)
Reserved or percolation queries to be executed when new data becomes available 
Search by dates as mentioned above

Regards


Sent from my iPhone

On May 28, 2013, at 4:25 PM, Chris Hostetter  wrote:

> 
> : This is kind of the approach used by elastic search , if I'm not using 
> : solrcloud will I be able to use shard aliasing, also with this approach 
> : how would replication work, is it even needed?
> 
> you haven't said much about hte volume of data you expect to deal with, 
> nor have you really explained what types of queries you intend to do -- 
> ie: you said you were intersted in a "rolling window of indexes
> around n days of data" but you never clarified why you think a 
> rolling window of indexes would be useful to you or how exactly you would 
> use it.
> 
> The primary advantage of sharding by date is if you know that a large 
> percentage of your queries are only going to be within a small range of 
> time, and therefore you can optimize those requests to only hit the shards 
> neccessary to satisfy that small windo of time.
> 
> if the majority of requests are going to be across your entire "n days" of 
> data, then date based sharding doesn't really help you -- you can just use 
> arbitrary (randomized) sharding using periodic deleteByQuery commands to 
> purge anything older then N days.  Query the whole collection by default, 
> and add a filter query if/when you want to restrict your search to only a 
> narrow date range of documents.
> 
> this is the same general approach you would use on a non-distributed / 
> non-SolrCloud setup if you just had a single collection on a single master 
> replicated to some number of slaves for horizontal scaling.
> 
> 
> -Hoss
>

RE: OPENNLP current patch compiling problem for 4.x branch

2013-05-28 Thread Patrick Mi

Thanks Steve, that worked for branch_4x 

-Original Message-
From: Steve Rowe [mailto:sar...@gmail.com] 
Sent: Friday, 24 May 2013 3:19 a.m.
To: solr-user@lucene.apache.org
Subject: Re: OPENNLP current patch compiling problem for 4.x branch

Hi Patrick,

I think you should check out and apply the patch to branch_4x, rather than
the lucene_solr_4_3_0 tag:

http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x

Steve

On May 23, 2013, at 2:08 AM, Patrick Mi 
wrote:

> Hi,
> 
> I checked out from here
> http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_3_0 and
> downloaded the latest patch LUCENE-2899-current.patch.
> 
> Applied the patch ok but when I did 'ant compile' I got the following
error:
> 
> 
> ==
>[javac]
>
/home/lucene_solr_4_3_0/lucene/analysis/opennlp/src/java/org/apache/lucene/a
> nalysis/opennlp/FilterPayloadsFilter.java:43: error
> r: cannot find symbol
>[javac] super(Version.LUCENE_44, input);
>[javac]  ^
>[javac]   symbol:   variable LUCENE_44
>[javac]   location: class Version
>[javac] 1 error
> ==
> 
> Compiled it on trunk without problem.
> 
> Is this patch supposed to work for 4.X?
> 
> Regards,
> Patrick 
>

Not so concurrent concurrency

2013-05-28 Thread Benson Margulies

 I can't quite apply SolrMeter to my problem, so I did something of my
own. The brains of the operation are the function here.

This feeds a ConcurrentUpdateSolrServer about 95 documents, each about
10mb, and 'threads' is six. Yet Solr just barely uses more than one
core.

   private long doIteration(File[] filesToRead) throws IOException,
SolrServerException {
ConcurrentUpdateSolrServer concurrentServer = new
ConcurrentUpdateSolrServer(launcher.getSolrServer().getBaseURL(),
1000, threads);
UpdateRequest updateRequest = new UpdateRequest(updateUrl);
updateRequest.setCommitWithin(1);
Stopwatch stopwatch = new Stopwatch();

List allFiles = Arrays.asList(filesToRead);
Iterator fileIterator = allFiles.iterator();
while (fileIterator.hasNext()) {
List thisBatch = Lists.newArrayList();
int batchByteCount = 0;
while (batchByteCount < BATCH_LIMIT && fileIterator.hasNext()) {
File thisFile = fileIterator.next();
thisBatch.add(thisFile);
batchByteCount += thisFile.length();
}
LOG.info(String.format("update %s files", thisBatch.size()));
updateRequest.setDocIterator(new
StreamingDocumentIterator(thisBatch));
stopwatch.start();
concurrentServer.request(updateRequest);
concurrentServer.blockUntilFinished();
stopwatch.stop();
}

Re: Solr/Lucene Analayzer That Writes To File

2013-05-28 Thread Roman Chyla

You can store them and then use different analyzer chains on it (stored,
doesn't need to be indexed)

I'd probably use the collector pattern


se.search(new MatchAllDocsQuery(), new Collector() {
  private AtomicReader reader;
  private int i = 0;

  @Override
  public boolean acceptsDocsOutOfOrder() {
return true;
  }

  @Override

  public void collect(int i) {
Document d;
try {
  d = reader.document(i, fieldsToLoad);
  for (String f: fieldsToLoad) {
String[] vals = d.getValues(f);
for (String s: vals) {
  TokenStream ts = analyzer.tokenStream(targetAnalyzer,
new StringReader(s));
  ts.reset();
  while (ts.incrementToken()) {
//do something with the analyzed tokens
  }

}
  }
} catch (IOException e) {
  // pass

}
  }
  @Override

  public void setNextReader(AtomicReaderContext context) {
this.reader = context.reader();
  }

  @Override
  public void setScorer(org.apache.lucene.search.Scorer scorer) {
// Do Nothing

  }
});

// or persist the data here if one of your components knows to
write to disk, but there is no api...
TokenStream ts = analyzer.tokenStream(data.targetField, new
StringReader("xxx"));
ts.reset();
ts.reset();
ts.reset();

  }



On Mon, May 27, 2013 at 9:37 AM, Furkan KAMACI wrote:

> Hi;
>
> I want to use Solr for an academical research. One step of my purpose is I
> want to store tokens in a file (I will store it at a database later) and I
> don't want to index them. For such kind of purposes should I use core
> Lucene or Solr? Is there an example for writing a custom analyzer and just
> storing tokens in a file?
>

How apache solr stores indexes

2013-05-28 Thread Kamal Palei

Dear All
I have a basic doubt how the data is stored in apache solr indexes.

Say I have thousand registered users in my site. Lets say I want to store
skills of each users as a multivalued string index.

Say
user 1 has skill set - Java, MySql, PHP
user 2 has skill set - C++, MySql, PHP
user 3 has skill set - Java, Android, iOS
... so on

You can see user 1 and 2 has two common skills that is MySql and PHP
In an actual case there might be millions of repetition of words.

Now question is, does apache solr stores them as just words, OR converts
each words to an unique number and stores the number only.

Best Regards
Kamal
Net Cloud Systems
Bangalore, India

Re: How apache solr stores indexes

2013-05-28 Thread Alexandre Rafalovitch

And you need to know this why?

If you are really trying to understand how this all works under the
covers, you need to look at Lucene's inverted index as a start. Start
here: 
http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene42/package-summary.html#package_description

Might take you a couple of weeks to put it all together.

Or you could try asking the actual business-level question that you
need an answer to. :-)

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Tue, May 28, 2013 at 10:13 PM, Kamal Palei  wrote:
> Dear All
> I have a basic doubt how the data is stored in apache solr indexes.
>
> Say I have thousand registered users in my site. Lets say I want to store
> skills of each users as a multivalued string index.
>
> Say
> user 1 has skill set - Java, MySql, PHP
> user 2 has skill set - C++, MySql, PHP
> user 3 has skill set - Java, Android, iOS
> ... so on
>
> You can see user 1 and 2 has two common skills that is MySql and PHP
> In an actual case there might be millions of repetition of words.
>
> Now question is, does apache solr stores them as just words, OR converts
> each words to an unique number and stores the number only.
>
> Best Regards
> Kamal
> Net Cloud Systems
> Bangalore, India

Re: How apache solr stores indexes

2013-05-28 Thread Shashi Kant

Better still start here: http://en.wikipedia.org/wiki/Inverted_index

http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html

And there are several books on search engines and related algorithms.



On Tue, May 28, 2013 at 10:41 PM, Alexandre Rafalovitch
wrote:

> And you need to know this why?
>
> If you are really trying to understand how this all works under the
> covers, you need to look at Lucene's inverted index as a start. Start
> here:
> http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene42/package-summary.html#package_description
>
> Might take you a couple of weeks to put it all together.
>
> Or you could try asking the actual business-level question that you
> need an answer to. :-)
>
> Regards,
>Alex.
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
>
>
> On Tue, May 28, 2013 at 10:13 PM, Kamal Palei 
> wrote:
> > Dear All
> > I have a basic doubt how the data is stored in apache solr indexes.
> >
> > Say I have thousand registered users in my site. Lets say I want to store
> > skills of each users as a multivalued string index.
> >
> > Say
> > user 1 has skill set - Java, MySql, PHP
> > user 2 has skill set - C++, MySql, PHP
> > user 3 has skill set - Java, Android, iOS
> > ... so on
> >
> > You can see user 1 and 2 has two common skills that is MySql and PHP
> > In an actual case there might be millions of repetition of words.
> >
> > Now question is, does apache solr stores them as just words, OR converts
> > each words to an unique number and stores the number only.
> >
> > Best Regards
> > Kamal
> > Net Cloud Systems
> > Bangalore, India
>

Re: How apache solr stores indexes

2013-05-28 Thread Kamal Palei

Thanks Alex.

I am in dilemma how do I store the skill sets with solr index as a string
token or as an integer. To give little background -

As of today, each skill I assign a unique id (take as auto increment field
in mysql table), and the store them against user id in a separate table.
That's how I do search for users having  a particular skill or retrieve
complete skill set of a particular user.

Now I want to dump everything to solr and will minimize mysql usage as low
as possible. This will help me to scale to higher load.

I am just weighing down options between
1. Should I store each skill as a string token (in a new multivalued string
index)
2. OR should I store each skill as an integer (in a new multivalued integer
index)

Kindly suggest which is better option.

Best Regards
kamal

On Wed, May 29, 2013 at 8:11 AM, Alexandre Rafalovitch
wrote:

> And you need to know this why?
>
> If you are really trying to understand how this all works under the
> covers, you need to look at Lucene's inverted index as a start. Start
> here:
> http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene42/package-summary.html#package_description
>
> Might take you a couple of weeks to put it all together.
>
> Or you could try asking the actual business-level question that you
> need an answer to. :-)
>
> Regards,
>Alex.
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
>
>
> On Tue, May 28, 2013 at 10:13 PM, Kamal Palei 
> wrote:
> > Dear All
> > I have a basic doubt how the data is stored in apache solr indexes.
> >
> > Say I have thousand registered users in my site. Lets say I want to store
> > skills of each users as a multivalued string index.
> >
> > Say
> > user 1 has skill set - Java, MySql, PHP
> > user 2 has skill set - C++, MySql, PHP
> > user 3 has skill set - Java, Android, iOS
> > ... so on
> >
> > You can see user 1 and 2 has two common skills that is MySql and PHP
> > In an actual case there might be millions of repetition of words.
> >
> > Now question is, does apache solr stores them as just words, OR converts
> > each words to an unique number and stores the number only.
> >
> > Best Regards
> > Kamal
> > Net Cloud Systems
> > Bangalore, India
>

Re: How apache solr stores indexes

2013-05-28 Thread Alexandre Rafalovitch

Store them as a string token in multivalued fields. Solr/Lucene will
do the necessary mapping and lookups. That's what you are paying it
for. :-) That way you can easily facet and so on.

You may need to change some parts of your architecture later, but you
seem to be over-thinking it too early in the process.

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Tue, May 28, 2013 at 10:54 PM, Kamal Palei  wrote:
> Thanks Alex.
>
> I am in dilemma how do I store the skill sets with solr index as a string
> token or as an integer. To give little background -
>
> As of today, each skill I assign a unique id (take as auto increment field
> in mysql table), and the store them against user id in a separate table.
> That's how I do search for users having  a particular skill or retrieve
> complete skill set of a particular user.
>
> Now I want to dump everything to solr and will minimize mysql usage as low
> as possible. This will help me to scale to higher load.
>
> I am just weighing down options between
> 1. Should I store each skill as a string token (in a new multivalued string
> index)
> 2. OR should I store each skill as an integer (in a new multivalued integer
> index)
>
> Kindly suggest which is better option.
>
> Best Regards
> kamal
>
>
>
>
>
>
> On Wed, May 29, 2013 at 8:11 AM, Alexandre Rafalovitch
> wrote:
>
>> And you need to know this why?
>>
>> If you are really trying to understand how this all works under the
>> covers, you need to look at Lucene's inverted index as a start. Start
>> here:
>> http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene42/package-summary.html#package_description
>>
>> Might take you a couple of weeks to put it all together.
>>
>> Or you could try asking the actual business-level question that you
>> need an answer to. :-)
>>
>> Regards,
>>Alex.
>> Personal blog: http://blog.outerthoughts.com/
>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>> - Time is the quality of nature that keeps events from happening all
>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>> book)
>>
>>
>> On Tue, May 28, 2013 at 10:13 PM, Kamal Palei 
>> wrote:
>> > Dear All
>> > I have a basic doubt how the data is stored in apache solr indexes.
>> >
>> > Say I have thousand registered users in my site. Lets say I want to store
>> > skills of each users as a multivalued string index.
>> >
>> > Say
>> > user 1 has skill set - Java, MySql, PHP
>> > user 2 has skill set - C++, MySql, PHP
>> > user 3 has skill set - Java, Android, iOS
>> > ... so on
>> >
>> > You can see user 1 and 2 has two common skills that is MySql and PHP
>> > In an actual case there might be millions of repetition of words.
>> >
>> > Now question is, does apache solr stores them as just words, OR converts
>> > each words to an unique number and stores the number only.
>> >
>> > Best Regards
>> > Kamal
>> > Net Cloud Systems
>> > Bangalore, India
>>

Re: How apache solr stores indexes

2013-05-28 Thread Jack Krupansky

As a general rule with Solr, do a proof of concept implementation with the 
simplest sensible approach and only start piling on complexity if 
performance or capacity become problematic. If the data is naturally a 
string, use a string. If it is naturally a number, use a number. Use 
whatever the query client's will be most comfortable with.

-- Jack Krupansky

-Original Message- 
From: Kamal Palei

Sent: Tuesday, May 28, 2013 10:54 PM
To: solr-user@lucene.apache.org
Subject: Re: How apache solr stores indexes

Thanks Alex.

I am in dilemma how do I store the skill sets with solr index as a string
token or as an integer. To give little background -

As of today, each skill I assign a unique id (take as auto increment field
in mysql table), and the store them against user id in a separate table.
That's how I do search for users having  a particular skill or retrieve
complete skill set of a particular user.

Now I want to dump everything to solr and will minimize mysql usage as low
as possible. This will help me to scale to higher load.

I am just weighing down options between
1. Should I store each skill as a string token (in a new multivalued string
index)
2. OR should I store each skill as an integer (in a new multivalued integer
index)

Kindly suggest which is better option.

Best Regards
kamal

On Wed, May 29, 2013 at 8:11 AM, Alexandre Rafalovitch
wrote:

And you need to know this why?

If you are really trying to understand how this all works under the
covers, you need to look at Lucene's inverted index as a start. Start
here:
http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene42/package-summary.html#package_description

Might take you a couple of weeks to put it all together.

Or you could try asking the actual business-level question that you
need an answer to. :-)

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)

On Tue, May 28, 2013 at 10:13 PM, Kamal Palei 
wrote:
> Dear All
> I have a basic doubt how the data is stored in apache solr indexes.
>
> Say I have thousand registered users in my site. Lets say I want to 
> store

> skills of each users as a multivalued string index.
>
> Say
> user 1 has skill set - Java, MySql, PHP
> user 2 has skill set - C++, MySql, PHP
> user 3 has skill set - Java, Android, iOS
> ... so on
>
> You can see user 1 and 2 has two common skills that is MySql and PHP
> In an actual case there might be millions of repetition of words.
>
> Now question is, does apache solr stores them as just words, OR converts
> each words to an unique number and stores the number only.
>
> Best Regards
> Kamal
> Net Cloud Systems
> Bangalore, India

Re: How apache solr stores indexes

2013-05-28 Thread Kamal Palei

Thanks a lot for all your input.
I will go ahead and store as strings.

Best Regards
Kamal


On Wed, May 29, 2013 at 9:00 AM, Jack Krupansky wrote:

> As a general rule with Solr, do a proof of concept implementation with the
> simplest sensible approach and only start piling on complexity if
> performance or capacity become problematic. If the data is naturally a
> string, use a string. If it is naturally a number, use a number. Use
> whatever the query client's will be most comfortable with.
>
> -- Jack Krupansky
>
> -Original Message- From: Kamal Palei
> Sent: Tuesday, May 28, 2013 10:54 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How apache solr stores indexes
>
>
> Thanks Alex.
>
> I am in dilemma how do I store the skill sets with solr index as a string
> token or as an integer. To give little background -
>
> As of today, each skill I assign a unique id (take as auto increment field
> in mysql table), and the store them against user id in a separate table.
> That's how I do search for users having  a particular skill or retrieve
> complete skill set of a particular user.
>
> Now I want to dump everything to solr and will minimize mysql usage as low
> as possible. This will help me to scale to higher load.
>
> I am just weighing down options between
> 1. Should I store each skill as a string token (in a new multivalued string
> index)
> 2. OR should I store each skill as an integer (in a new multivalued integer
> index)
>
> Kindly suggest which is better option.
>
> Best Regards
> kamal
>
>
>
>
>
>
> On Wed, May 29, 2013 at 8:11 AM, Alexandre Rafalovitch
> wrote:
>
>  And you need to know this why?
>>
>> If you are really trying to understand how this all works under the
>> covers, you need to look at Lucene's inverted index as a start. Start
>> here:
>> http://lucene.apache.org/core/**4_3_0/core/org/apache/lucene/**
>> codecs/lucene42/package-**summary.html#package_**description
>>
>> Might take you a couple of weeks to put it all together.
>>
>> Or you could try asking the actual business-level question that you
>> need an answer to. :-)
>>
>> Regards,
>>Alex.
>> Personal blog: http://blog.outerthoughts.com/
>> LinkedIn: 
>> http://www.linkedin.com/in/**alexandrerafalovitch
>> - Time is the quality of nature that keeps events from happening all
>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>> book)
>>
>>
>> On Tue, May 28, 2013 at 10:13 PM, Kamal Palei 
>> wrote:
>> > Dear All
>> > I have a basic doubt how the data is stored in apache solr indexes.
>> >
>> > Say I have thousand registered users in my site. Lets say I want to >
>> store
>> > skills of each users as a multivalued string index.
>> >
>> > Say
>> > user 1 has skill set - Java, MySql, PHP
>> > user 2 has skill set - C++, MySql, PHP
>> > user 3 has skill set - Java, Android, iOS
>> > ... so on
>> >
>> > You can see user 1 and 2 has two common skills that is MySql and PHP
>> > In an actual case there might be millions of repetition of words.
>> >
>> > Now question is, does apache solr stores them as just words, OR converts
>> > each words to an unique number and stores the number only.
>> >
>> > Best Regards
>> > Kamal
>> > Net Cloud Systems
>> > Bangalore, India
>>
>>
>

Choosing specific fields for suggestions in SpellCheckerComponent

2013-05-28 Thread Wilson Passos


Hi everyone,


I've been searching about how to configure the SpellCheckerComponent in 
Solr 4.0 to support suggestion queries based on s subset of the 
configured fields in schema.xml. Let's say the spell checking is 
configured to use these 4 fields:







I'd like to know if there's any possibility to dynamically set the 
SpellCheckerComponent to suggest terms using just fields "field2" and 
"field3" instead of the default behavior, which always includes 
suggestions across the 4 defined fields.


Thanks in advance for any help!

Re: [blog post] Automatically Acquiring Synonym Knowledge from Wikipedia

2013-05-28 Thread Rajesh Nikam

Hi Koji,

Great news ! I am looking forward for this OpenNLP toolkit.

Thanks a lot !
Rajesh



On Wed, May 29, 2013 at 4:12 AM, Koji Sekiguchi  wrote:

> Hi Rajesh,
>
> Thanks!
> I'm planning to open an NLP tool kit for Lucene, and the tool kit will
> include
> the following synonym library.
>
> koji
>
>
> (13/05/28 14:12), Rajesh Nikam wrote:
>
>> Hello Koji,
>>
>> This is seems pretty useful post on how to create synonyms file.
>> Thanks a lot for sharing this !
>>
>> Have you shared source code / jar for the same so at it could be used ?
>>
>> Thanks,
>> Rajesh
>>
>>
>>
>> On Mon, May 27, 2013 at 8:44 PM, Koji Sekiguchi 
>> wrote:
>>
>>  Hello,
>>>
>>> Sorry for cross post. I just wanted to announce that I've written a blog
>>> post on
>>> how to create synonyms.txt file automatically from Wikipedia:
>>>
>>>
>>> http://soleami.com/blog/**automatically-acquiring-**
>>> synonym-knowledge-from-**wikipedia.html
>>>
>>> Hope that the article gives someone a good experience!
>>>
>>> koji
>>> --
>>>
>>> http://soleami.com/blog/**lucene-4-is-super-convenient-**
>>> for-developing-nlp-tools.html
>>>
>>>
>>
>
> --
> http://soleami.com/blog/**automatically-acquiring-**
> synonym-knowledge-from-**wikipedia.html
>

OPENNLP problems

2013-05-28 Thread Patrick Mi

Hi there,

Checked out branch_4x and applied the latest patch
LUCENE-2899-current.patch however I ran into 2 problems

Followed the wiki page instruction and set up a field with this type aiming
to keep nouns and verbs and do a facet on the field
==

  




  

==

Struggled to get that going until I put the extra parameter
keepPayloads="true" in as below. 
 

Question: am I doing the right thing? Is this a mistake on wiki 

Second problem:

Posted the document xml one by one to the solr and the result was what I
expected.



  1
  check in the hotel


However if I put multiple documents into the same xml file and post it in
one go only the first document gets processed( only 'check' and 'hotel' were
showing in the facet result.) 
 


  1
  check in the hotel


  2
  removes the payloads


  3
  retains only nouns and verbs 



Same problem when updated the data using csv upload.

Is that a bug or something I did wrong?

Thanks in advance!

Regards,
Patrick

Sorting results by last update date

2013-05-28 Thread Kamal Palei

Hi All
I am trying to sort the results as per last updated date. My url looks as
below.

*&fq=last_updated_date:[NOW-60DAY TO NOW]&fq=experience:[0 TO
588]&fq=salary:[0 TO 500] OR
salary:0&fq=-bundle:job&fq=-bundle:panel&fq=-bundle:page&fq=-bundle:article&spellcheck=true&q=+java
+sip&fl=id,entity_id,entity_type,bundle,bundle_name,label,is_comment_count,ds_created,ds_changed,score,path,url,is_uid,tos_name,zm_parent_entity,ss_filemime,ss_file_entity_title,ss_file_entity_url,ss_field_uid&spellcheck.q=+java
+sip&qf=content^40&qf=label^5.0&qf=tos_content_extra^0.1&qf=tos_name^3.0&hl.fl=content&mm=1&q.op=AND&wt=json&
json.nl=map&sort=last_updated_date asc
*
With this I get the data in ascending order of last updated date.

If I am trying to sort data in descending order, I use below url

*&fq=last_updated_date:[NOW-60DAY TO NOW]&fq=experience:[0 TO
588]&fq=salary:[0 TO 500] OR
salary:0&fq=-bundle:job&fq=-bundle:panel&fq=-bundle:page&fq=-bundle:article&spellcheck=true&q=+java
+sip&fl=id,entity_id,entity_type,bundle,bundle_name,label,is_comment_count,ds_created,ds_changed,score,path,url,is_uid,tos_name,zm_parent_entity,ss_filemime,ss_file_entity_title,ss_file_entity_url,ss_field_uid&spellcheck.q=+java
+sip&qf=content^40&qf=label^5.0&qf=tos_content_extra^0.1&qf=tos_name^3.0&hl.fl=content&mm=1&q.op=AND&wt=json&
json.nl=map&sort=last_updated_date desc*

Here the data set is not ordered properly, mostly it looks to me data is
ordered on basis of score, not last updated date.

Can somebody tell me what I am missing here, why *desc* is not working
properly for me.

Thanks
kamal

Re: What exactly happens to extant documents when the schema changes?

2013-05-28 Thread Dotan Cohen

On Tue, May 28, 2013 at 2:20 PM, Upayavira  wrote:
> The schema provides Solr with a description of what it will find in the
> Lucene indexes. If you, for example, changed a string field to an
> integer in your schema, that'd mess things up bigtime. I recently had to
> upgrade a date field from the 1.4.1 date field format to the newer
> TrieDateField. Given I had to do it on a live index, I had to add a new
> field (just using copyfield) and re-index over the top, as the old field
> was still in use. I guess, given my app now uses the new date field
> only, I could presumably reindex the old date field with the new
> TrieDateField format, but I'd want to try that before I do it for real.
>

Thank you for the insight. Unfortunately, with 20 million records and
growing by hundreds each minute (social media posts) I don't see that
I could ever reindex the data in a timely way.


> However, if you changed a single valued field to a multi-valued one,
> that's not an issue, as a field with a single value is still valid for a
> multi-valued field.
>
> Also, if you add a new field, existing documents will be considered to
> have no value in that field. If that is acceptable, then you're fine.
>
> I guess if you remove a field, then those fields will be ignored by
> Solr, and thus not impact anything. But I have to say, I've never tried
> that.
>
> Thus - changing the schema will only impact on future indexing. Whether
> your existing index will still be valid depends upon the changes you are
> making.
>
> Upayavira

Thanks.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

81 matches

Mail list logo