Re: Can index size increase when no updates/optimizes are happening?

2013-03-18 Thread eanand333
This is what we do,
A user logs in - enter s a few documents in a particular domain, say A, B
or C - logs out.
Say B is the most commonly used domain. The increase in index size is
drastic only in this particular domain.

So unless a user logs in there s no question of documents being submitted
or any indexing activity.
>From the logs i see no user logged in during that time frame.

Adding to this we have to take a back up of the index every day and that s
how we even came to know that such a problem even existed.

Is there a possibility that i could schedule an optimize to run at a
specific time during the day and hence try control the index s file size?






On Sat, Mar 16, 2013 at 6:51 PM, Erick Erickson [via Lucene] <
ml-node+s472066n4047962...@n3.nabble.com> wrote:

> Well, if nothing is going on at all, it's hard to see why the index would
> increase. So I suspect _something_ is going on. Possibilities:
>
> 1> you have indexing activity going on. Even if it's just replacing docs
> that already exist, which is actually an add and a delete the index will
> grow for a while; the deleted info isn't removed from the index until the
> segment is merged, which happens unpredictably (well, actually predictably
> but not on a schedule you enforce). So the index would
> grow/shrink/grow/shrink. Do you have automatic process in the background
> that push docs to the index?
>
> 2> You forceMerge (optimize), in which case the index will at least double
> in size temporarily.
>
> 3> You are replicating. While the replication goes on, especially if your
> index has changed greatly, then your index could double.
>
> None of these fit the symptoms you describe very well mind you. It's
> suspicious that these increases last long enough for you to see them
> predictably in the morning, unless, say, a background process indexes
> things regularly and you always look at the same time... unlikely at best.
>
> But the fact that your index goes from 18G to 8G strongly suggests that
> you're doing a forceMerge/optimize when you see it bump up to 18G.
>
> Best
> Erick
>
>
>
>
> On Sat, Mar 16, 2013 at 4:52 AM, eanand333 <[hidden 
> email]>
> wrote:
>
> > Hi, I am kind of new in here. Got the same question...
> > I am using Java version 1.6 and Lucene version 3.3.
> > Can the index file size increase automatically over night?
> > During the evening i see the size around 11GB, next day morning i see it
> to
> > be 18GB and again the size reduces around 8GB.
> > I have checked the logs and i am sure that there was no user activity
> > during
> > this period.
> >
> > Mr. Eric has suggested this to be a user error, if that s true, i would
> > like
> > to know what are the possible errors which could result in the rapid
> > increase in index file size ?
> >
> > Or what are the other possibilities for the index size to increase
> > exponentially?
> >
> > Thanks
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Can-index-size-increase-when-no-updates-optimizes-are-happening-tp3334022p4047945.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Can-index-size-increase-when-no-updates-optimizes-are-happening-tp3334022p4047962.html
>  To unsubscribe from Can index size increase when no updates/optimizes are
> happening?, click 
> here
> .
> NAML
>




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-index-size-increase-when-no-updates-optimizes-are-happening-tp3334022p4048444.html
Sent from the Solr - User mailing list archive at Nabble.com.

Facets with 5000 facet fields

2013-03-18 Thread sivaprasad
Hi,

We have configured solr for 5000 facet fields as part of request handler.We
have 10811177 docs in the index.

The solr server machine is quad core with 12 gb of RAM.

When we are querying with facets, we are getting out of memory error.

What we observed is , If we have larger number of facets we need to have
larger RAM allocated for JVM. In this case we need to scale up the system as
and when we add more facets.

To scale out the system, do we need to go with distributed search?

Any thoughts on this helps me to handle this situation.

Thanks,
Siva




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Facets-with-5000-facet-fields-tp4048450.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: removing all fields before full import using DIH

2013-03-18 Thread Gora Mohanty
On 18 March 2013 13:09, Rohan Thakur  wrote:
> hi all
>
> how can I ensure that I have delete all the fields for solr before doing
> full import in DIH only? the aim is that my database is pretty small so
> full import takes only 3-4 sec. thus I do not require delta import for now
> and I want to ensure that when ever I do full import of the database the
> duplicate field do not get indexed that is multiple instances of same
> document does not get indexed so want to delete all the documents first and
> then reindex using full import. any one help.

Have you tried a full-import? What you want is done by default by
DIH, unless one specifies clean=false as a query parameter to the
full-import URL.

Regards,
Gora


Re: Facets with 5000 facet fields

2013-03-18 Thread Upayavira
I'd be very surprised if this were to work. I recall one situation in
which 24 facets in a request placed too much pressure on the server.

In order to support faceting, Solr maintains a cache of the faceted
field. You need one cache for each field you are faceting on, meaning
your memory requirements will be substantial, unless, I guess, your
fields are sparse. Also, during a faceting request, the server must do a
scan across each of those fields, and that will take time, and with tat
many fields, I'd imagine quite a bit of time.

Upayavira

On Mon, Mar 18, 2013, at 07:34 AM, sivaprasad wrote:
> Hi,
> 
> We have configured solr for 5000 facet fields as part of request
> handler.We
> have 10811177 docs in the index.
> 
> The solr server machine is quad core with 12 gb of RAM.
> 
> When we are querying with facets, we are getting out of memory error.
> 
> What we observed is , If we have larger number of facets we need to have
> larger RAM allocated for JVM. In this case we need to scale up the system
> as
> and when we add more facets.
> 
> To scale out the system, do we need to go with distributed search?
> 
> Any thoughts on this helps me to handle this situation.
> 
> Thanks,
> Siva
> 
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Facets-with-5000-facet-fields-tp4048450.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Facets with 5000 facet fields

2013-03-18 Thread Toke Eskildsen
On Mon, 2013-03-18 at 08:34 +0100, sivaprasad wrote:
> We have configured solr for 5000 facet fields as part of request handler.We
> have 10811177 docs in the index.
> 
> The solr server machine is quad core with 12 gb of RAM.
> 
> When we are querying with facets, we are getting out of memory error.

Solr's faceting treats each field separately. This makes it flexible,
but also means that it has a speed as well as a memory penalty when the
number of fields rises.

It depends on what you are faceting on, but let's say that you are
faceting on Strings and that each field has 200 unique values. For each
field, a list with #docs entries of size log2(#unique_values) bits will
be maintained. With 11M documents and 200 unique values, this is 11M * 8
= 88MBit ~= 9MByte. There is more overhead than this, but it is
irrelevant for this back-on-the-envelope calculation.

5000 fields @ 9 MByte is about 45GB for faceting.

If you had a single field with 200 * 5000 unique values, the memory
penalty would be 11M * log2(200*5000) bits = 11M * 20 bits ~= 30MB + the
some extra overhead.

It seems that the way forward is to see if you can somehow reduce your
requirements from the heavy "facet on 5000 fields" to something more
manageble.

Du you always facet on all the fields for each call? If not, you could
create a single facet field and prefix all values with the facet:

field1/value1a
field1/value1b
field2/value2a
field2/value2b
field2/value2c

and so on. To perform faceting on field 2, make a facet prefix query for
"field2/".


If you do need to facet on all 5000 fields each time, you could just
repeat the above 5000 times. It will work, take little memory and will
likely take far too long. 

If you are feeling really adventurous, take a look at 
https://issues.apache.org/jira/browse/SOLR-2412
it creates a single structure for a multi-field request, meaning that
only a single 11M entry array will be created for the 11M documents. The
full memory overhead should be around the same as with a single field.

I haven't tested SOLR-2412 on anything near your corpus, but it is a
very interesting test case.

> What we observed is , If we have larger number of facets we need to have
> larger RAM allocated for JVM. In this case we need to scale up the system as
> and when we add more facets.
> 
> To scale out the system, do we need to go with distributed search?

That would work if you do not need to facet on all fields all the time.
If you do need to facet on all fields on each call, you will need to
scale to many machines to get proper performance and the merging
overhead will likely be huge.

Regards,
Toke Eskildsen



Re: removing all fields before full import using DIH

2013-03-18 Thread Rohan Thakur
k thanks yes I dint checked it before I was using DIH full import directly
and one day I observed that my solr search was giving duplicate results
then I deleted all the entries and re index the dataand after that for
ensure that this does not happen I always use delete first then do full
import...k so this automatically does that...

thanks for confirming.

regards
Rohan

On Mon, Mar 18, 2013 at 1:32 PM, Gora Mohanty  wrote:

> On 18 March 2013 13:09, Rohan Thakur  wrote:
> > hi all
> >
> > how can I ensure that I have delete all the fields for solr before doing
> > full import in DIH only? the aim is that my database is pretty small so
> > full import takes only 3-4 sec. thus I do not require delta import for
> now
> > and I want to ensure that when ever I do full import of the database the
> > duplicate field do not get indexed that is multiple instances of same
> > document does not get indexed so want to delete all the documents first
> and
> > then reindex using full import. any one help.
>
> Have you tried a full-import? What you want is done by default by
> DIH, unless one specifies clean=false as a query parameter to the
> full-import URL.
>
> Regards,
> Gora
>


Re: Fuzzy Suggester and exactMatchFirst

2013-03-18 Thread Robert Muir
On Sun, Mar 17, 2013 at 8:19 PM, Eoghan Ó Carragáin
 wrote:
>
> I can see why the Fuzzy Suggester sees "college" as a match for "colla" but
> expected the exactMatchFirst parameter to ensure that suggestions beginning
> with "colla" to be weighted higher than "fuzzier" matches. I
> have spellcheck.onlyMorePopular set to true, in case this makes a
> difference.
>
> Am I misunderstanding what exactMatchFirst is supposed to do? Is there a
> way to ensure suggestions matching exactly what the user has entered rank
> higher than fuzzy matches?
>

I think exactMatchFirst is unrelated to typo-correction: it only
ensures that if you type the whole suggestion exactly that the weight
is completely ignored.
This means if you type 'college' and there is an actual suggestion of
'college' it will be weighted above 'colleges' even if colleges has a
much higher weight.

On the other hand what you want (i think) is to punish the weights of
suggestions that required some corrections. Currently I don't think
there is any way to do that:

 * NOTE: This suggester does not boost suggestions that
 * required no edits over suggestions that did require
 * edits.  This is a known limitation.

I think the trickiest part about this is how the "punishment" formula
should work. Because today this thing makes no assumptions as to how
you came up with your suggestion weights...

But feel free to open a JIRA issue if you have ideas !


Incorrect snippets using FastVectorHighlighter

2013-03-18 Thread Jochen Just
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi list,

i have the following field type in my schema.xml defined in order to be able to 
do in word search.

   
  




  
  
 
 
  


Searching itself works as expected, though highlighting causes me headaches.
At first I did not use the FastVectorHighlighter, which meant highlighting did
not work at all for fields of this type. Since I'm using the 
FastVectorHighlighter
most of the time highlighting works, sometimes it doesn't.

Given I have a document containing the word 
'Superkalifragilistischexpialligetisch'
and I search for 'uperkalifragilistische', I would expect as result 
'Superkalifragilistischexpiallegetisch'
but it is 'Superkalifragilistischexpialligetisch'. So there is 'ische'
missing in the highlighted part.

Sadly, I am not able to create a simple setup to reproduce this, but it only 
happens in our in-house live system.
Though if I remove some fields from my qf attribute of the edismax parser in 
solconfig.xml, it stops behaving like that.
Some of those removed fields have the fieldType string_parts_back.

Does any one have a clue, what's going on?

Thanks in advance,
Jochen


- -- 
Jochen Just   Fon:   (++49) 711/28 07 57-193
avono AG  Mobil: (++49) 172/73 85 387
Breite Straße 2   Mail:  jochen.j...@avono.de
70173 Stuttgart   WWW:   http://www.avono.de
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQIcBAEBAgAGBQJRRxP5AAoJEP1xbhgWUHmSRAsP/AlLHWA6Pw6Jk5Pmr0rqiAxE
IsJ6HeL+4e56IHsKsruBY7HOGdEwRvXHSkwlKGLF+dvyzz4/lx7wbGBHJCMJJkDe
Yas9izso5z4KGKzKazMYPPKoXja67zmWmRU5PYG/exT8N1gjnA98KTzXAA47xIxA
rm9zUBImPF1eIZmEBcytI/+EMJI4Cy30OvRyWfc6XoxF7Kq5wJuMXvTWl24gM0tQ
xdPUVZ6ir8IkrGw2P7d3/IgaAtYbT+SEAuFjSE9rtS8KdJfWbXDYYupqNV59Syqh
7F5ywEOgnt/OBTODFp9FR4ElakOlSZrmRk8CgYfUZZu9vNASxyBnCWwhz+CkCbfQ
fYRzy1HyDUGIGFl6FAi+4WE4av5EdWUH6N0UEdUkE6tI5b/IqzGIdocSl36PqeMR
za7jKfU9LWqc+Xoh27wLP8Wi11t/XIRQuRCxKSFpc2Go3iweCTu+cXr1K6XTndj/
uoptQ1nJJcQTRmdvxlxA5jvrVaGvOclEEFsndQWyq6wK7CJ9k+FOHfYwc7p3L1Bp
QoTTErdEKgCZj+w39Ma0ASURBX1+jjLqRnMvleSD4CX2K78z8Z7c5a7m48192D6u
mg6uOIUyTdTPH5SLUOU+rNDjOuLLbJOuVGXdpSqYymkr2WPlwwBj+ZYGx1lap1xE
5ZgU5nHnodtUAC9jjz52
=KsNm
-END PGP SIGNATURE-


Re: Solr indexing binary files

2013-03-18 Thread Luis
Hi Gora,

Yes, my urlpath points to an url like that.  I do not get why uncommenting
the catch all dynamic field ("*") does not work for me.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-binary-files-tp4047470p4048542.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Handling a closed IndexWriter in SOLR 4.0

2013-03-18 Thread Mark Miller
I'll fix it - I put up a patch last night.

- Mark

On Mar 18, 2013, at 1:12 AM, mark12345  wrote:

> This looks similar to the issue I also have:
> 
> *
> http://lucene.472066.n3.nabble.com/Solr-4-1-4-2-SolrException-Error-opening-new-searcher-td4046543.html
> 
>   
> *https://issues.apache.org/jira/browse/SOLR-4605
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Handling-a-closed-IndexWriter-in-SOLR-4-0-tp4047392p4048421.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Incorrect snippets using FastVectorHighlighter

2013-03-18 Thread Koji Sekiguchi

Hi Jochen,

There is a restriction in FVH. FVH cannot deal with variable gram size.
That is, minGramSize == maxGramSize in your NGramFilterFactory setting.

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html


(13/03/18 22:17), Jochen Just wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi list,

i have the following field type in my schema.xml defined in order to be able to 
do in word search.


   
 
 
 
 
   
   
  
  
   
 

Searching itself works as expected, though highlighting causes me headaches.
At first I did not use the FastVectorHighlighter, which meant highlighting did
not work at all for fields of this type. Since I'm using the 
FastVectorHighlighter
most of the time highlighting works, sometimes it doesn't.

Given I have a document containing the word 
'Superkalifragilistischexpialligetisch'
and I search for 'uperkalifragilistische', I would expect as result 
'Superkalifragilistischexpiallegetisch'
but it is 'Superkalifragilistischexpialligetisch'. So there is 'ische'
missing in the highlighted part.

Sadly, I am not able to create a simple setup to reproduce this, but it only 
happens in our in-house live system.
Though if I remove some fields from my qf attribute of the edismax parser in 
solconfig.xml, it stops behaving like that.
Some of those removed fields have the fieldType string_parts_back.

Does any one have a clue, what's going on?

Thanks in advance,
Jochen


- --
Jochen Just   Fon:   (++49) 711/28 07 57-193
avono AG  Mobil: (++49) 172/73 85 387
Breite Straße 2   Mail:  jochen.j...@avono.de
70173 Stuttgart   WWW:   http://www.avono.de
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQIcBAEBAgAGBQJRRxP5AAoJEP1xbhgWUHmSRAsP/AlLHWA6Pw6Jk5Pmr0rqiAxE
IsJ6HeL+4e56IHsKsruBY7HOGdEwRvXHSkwlKGLF+dvyzz4/lx7wbGBHJCMJJkDe
Yas9izso5z4KGKzKazMYPPKoXja67zmWmRU5PYG/exT8N1gjnA98KTzXAA47xIxA
rm9zUBImPF1eIZmEBcytI/+EMJI4Cy30OvRyWfc6XoxF7Kq5wJuMXvTWl24gM0tQ
xdPUVZ6ir8IkrGw2P7d3/IgaAtYbT+SEAuFjSE9rtS8KdJfWbXDYYupqNV59Syqh
7F5ywEOgnt/OBTODFp9FR4ElakOlSZrmRk8CgYfUZZu9vNASxyBnCWwhz+CkCbfQ
fYRzy1HyDUGIGFl6FAi+4WE4av5EdWUH6N0UEdUkE6tI5b/IqzGIdocSl36PqeMR
za7jKfU9LWqc+Xoh27wLP8Wi11t/XIRQuRCxKSFpc2Go3iweCTu+cXr1K6XTndj/
uoptQ1nJJcQTRmdvxlxA5jvrVaGvOclEEFsndQWyq6wK7CJ9k+FOHfYwc7p3L1Bp
QoTTErdEKgCZj+w39Ma0ASURBX1+jjLqRnMvleSD4CX2K78z8Z7c5a7m48192D6u
mg6uOIUyTdTPH5SLUOU+rNDjOuLLbJOuVGXdpSqYymkr2WPlwwBj+ZYGx1lap1xE
5ZgU5nHnodtUAC9jjz52
=KsNm
-END PGP SIGNATURE-





RE: SOLR - Define fields in DIH configuration file dynamically

2013-03-18 Thread Dyer, James
There are 3 approaches I can think of:

1. You can generate a new data-config.xml for each import.  With Solr 4.0 and 
later, DIH re-parses your data-config.xml and picks up any changes 
automatically.

2. You can parameterize nearly anything in data-config.xml, add the parameters 
to your request URL like:
/solr/dataimport?command=full-import&key=value
...then use ${dataimporter.request.key} in your data-config.xml

3. You can put optional sections into sql queries, then use parameters to 
insert comment strings:
/solr/dataimport?command=full-import&openComment=/*&closeComment=*/


Not sure if this is the answer to your question.  If not, give more details.

James Dyer
Ingram Content Group
(615) 213-4311

-Original Message-
From: kobe.free.wo...@gmail.com [mailto:kobe.free.wo...@gmail.com] 
Sent: Monday, March 18, 2013 4:03 AM
To: solr-user@lucene.apache.org
Subject: SOLR - Define fields in DIH configuration file dynamically

Hello All,

Is there a way to manage (add/ remove) fields in the data import handler
configuration file dynamically through any API/ Library like SolrNet?

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-Define-fields-in-DIH-configuration-file-dynamically-tp4048465.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: PDF keyword searches not accurate

2013-03-18 Thread JDJ
Does this make a difference?





-
JDJ
  "There are two kinds of people in the world;
  those who understand binary, and
  those who don't.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/PDF-keyword-searches-not-accurate-tp4046741p4048596.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Mark document as hidden

2013-03-18 Thread lboutros
Thanks Jack.

I finally managed to replicate the external files with my own replication
handler.

But now, there's an issue with Solr in the Update Log replay process.

The default processor chain is not used, this means that my processor which
manage the external files is not used...

I have created a Jira issue for this:

https://issues.apache.org/jira/browse/SOLR-4608

Ludovic.



-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Mark-document-as-hidden-tp4045756p4048622.html
Sent from the Solr - User mailing list archive at Nabble.com.


Group By and Sum

2013-03-18 Thread Adam Harris
Hello All,

Pretty stuck here and I am hoping you might be the person to help me out. I am 
working with SOLR and JSONiq which are totally new to me and doing even the 
simplest of things is just escaping me. I know SQL pretty well however this 
simple requirement seems escape me. I'll jump right into it.

Here is the schema of my Core:





   

   

   

   

   

   

   

   





I need to group by the month of BusinessDateTime and sum up NetSales and 
TransCount for a given date range. Now if this were SQL i would just right


SELECT sum(TransCount), sum(NetSales)

FROM Core

WHERE BusinessDateTime BETWEEN '2012/04/01' AND '2013/04/01'

GROUP BY MONTH(BusinessDateTime)

But ofcourse nothing is this simple with SOLR and/or JSONiq. I have tried 
messing around with Facet and Group but they never seem to work the way i want 
them to. For example here is a query i am currently playing with:


?wt=json

&indent=true

&q=*:*

&rows=0

&facet=true

&facet.date=BusinessDateTime

&facet.date.start=2012-02-01T00:00:01Z

&facet.date.end=2013-02-01T23:59:59Z

&facet.date.gap=%2B1MONTH

&group=true

&group.field=BusinessDateTime

&group.facet=true

&group.field=NetSales

Now the facet is working properly however it is returning the count of the 
documents however i need the sum of the NetSales and the TransCount fields 
instead.

Any help or suggestions would be greatly appreciated.

Thanks,
Adam


Re: Group By and Sum

2013-03-18 Thread Walter Underwood
You should use a relational database. Solr is not really designed for this kind 
of query.

wunder

On Mar 18, 2013, at 9:48 AM, Adam Harris wrote:

> Hello All,
> 
> Pretty stuck here and I am hoping you might be the person to help me out. I 
> am working with SOLR and JSONiq which are totally new to me and doing even 
> the simplest of things is just escaping me. I know SQL pretty well however 
> this simple requirement seems escape me. I'll jump right into it.
> 
> Here is the schema of my Core:
> 
> 
> 
> 
> 
>   
> 
>required="true"/>
> 
>   
> 
>required="true"/>
> 
>required="true"/>
> 
>   
> 
>required="true"/>
> 
>required="true"/>
> 
> 
> 
> 
> 
> I need to group by the month of BusinessDateTime and sum up NetSales and 
> TransCount for a given date range. Now if this were SQL i would just right
> 
> 
> SELECT sum(TransCount), sum(NetSales)
> 
> FROM Core
> 
> WHERE BusinessDateTime BETWEEN '2012/04/01' AND '2013/04/01'
> 
> GROUP BY MONTH(BusinessDateTime)
> 
> But ofcourse nothing is this simple with SOLR and/or JSONiq. I have tried 
> messing around with Facet and Group but they never seem to work the way i 
> want them to. For example here is a query i am currently playing with:
> 
> 
> ?wt=json
> 
> &indent=true
> 
> &q=*:*
> 
> &rows=0
> 
> &facet=true
> 
> &facet.date=BusinessDateTime
> 
> &facet.date.start=2012-02-01T00:00:01Z
> 
> &facet.date.end=2013-02-01T23:59:59Z
> 
> &facet.date.gap=%2B1MONTH
> 
> &group=true
> 
> &group.field=BusinessDateTime
> 
> &group.facet=true
> 
> &group.field=NetSales
> 
> Now the facet is working properly however it is returning the count of the 
> documents however i need the sum of the NetSales and the TransCount fields 
> instead.
> 
> Any help or suggestions would be greatly appreciated.
> 
> Thanks,
> Adam

--
Walter Underwood
wun...@wunderwood.org





Re: 4.0 hanging on startup on Windows after Control-C

2013-03-18 Thread Shawn Heisey

On 3/17/2013 11:51 AM, xavier jmlucjav wrote:

Hi,

I have an index where, if I kill solr via Control-C, it consistently hangs
next time I start it. Admin does not show cores, and searches never return.
If I delete the index contents and I restart again all is ok. I am on
windows 7, jdk1.7 and Solr4.0.
Is this a known issue? I looked in jira but found nothing.


I scanned your thread dump.  Nothing jumped out at me, but given my 
inexperience with such things, I'm not surprised by that.


Have you tried 4.1 or 4.2 yet to see if the problem persists?  4.0 is no 
longer the new hotness.


Below I will discuss the culprit that springs to mind, though I don't 
know whether it's what you are actually hitting.


One thing that can make Solr take a really long time to start up is huge 
transaction logs.  Transaction logs must be replayed when Solr starts, 
and if they are huge, it can take a really long time.


Do you have tlog directories in your cores (in the data dir, next to the 
index directory), and if you do, how much disk space do they use?  The 
example config in 4.x has updateLog turned on.


There are two common situations that can lead to huge transaction logs. 
 One is exclusively using soft commits when indexing, the other is 
running a very large import with the dataimport handler and not 
committing until the very end.


AutoCommit with openSearcher=false is a good solution to both of these 
situations.  As long as you use openSearcher=false, it will not change 
what documents are visible.  AutoCommit does a regular "hard" commit 
every X new documents or every Y milliseconds.  A hard commit flushes 
index data to disk and starts a new transaction log.  Solr will only 
keep a few transaction logs around, so frequently building new ones 
keeps their size down.  When you restart Solr, you don't need to wait 
for a long time while it replays them.


Thanks,
Shawn



RE: Group By and Sum

2013-03-18 Thread Adam Harris
I agree however the powers that be, being upper management, have decided that 
we need to switch to SOLR, JSONiq and JavaScript MVC for all our reporting 
needs. I would love to just keep using the SQL DB that we have been using but 
alas I am not allowed to.

Thanks,
Adam

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Monday, March 18, 2013 11:58 AM
To: solr-user@lucene.apache.org
Subject: Re: Group By and Sum

You should use a relational database. Solr is not really designed for this kind 
of query.

wunder

On Mar 18, 2013, at 9:48 AM, Adam Harris wrote:

> Hello All,
> 
> Pretty stuck here and I am hoping you might be the person to help me out. I 
> am working with SOLR and JSONiq which are totally new to me and doing even 
> the simplest of things is just escaping me. I know SQL pretty well however 
> this simple requirement seems escape me. I'll jump right into it.
> 
> Here is the schema of my Core:
> 
> 
> 
> 
> 
>   
> 
>required="true"/>
> 
>   
> 
>required="true"/>
> 
>required="true"/>
> 
>   
> 
>required="true"/>
> 
>required="true"/>
> 
> 
> 
> 
> 
> I need to group by the month of BusinessDateTime and sum up NetSales and 
> TransCount for a given date range. Now if this were SQL i would just right
> 
> 
> SELECT sum(TransCount), sum(NetSales)
> 
> FROM Core
> 
> WHERE BusinessDateTime BETWEEN '2012/04/01' AND '2013/04/01'
> 
> GROUP BY MONTH(BusinessDateTime)
> 
> But ofcourse nothing is this simple with SOLR and/or JSONiq. I have tried 
> messing around with Facet and Group but they never seem to work the way i 
> want them to. For example here is a query i am currently playing with:
> 
> 
> ?wt=json
> 
> &indent=true
> 
> &q=*:*
> 
> &rows=0
> 
> &facet=true
> 
> &facet.date=BusinessDateTime
> 
> &facet.date.start=2012-02-01T00:00:01Z
> 
> &facet.date.end=2013-02-01T23:59:59Z
> 
> &facet.date.gap=%2B1MONTH
> 
> &group=true
> 
> &group.field=BusinessDateTime
> 
> &group.facet=true
> 
> &group.field=NetSales
> 
> Now the facet is working properly however it is returning the count of the 
> documents however i need the sum of the NetSales and the TransCount fields 
> instead.
> 
> Any help or suggestions would be greatly appreciated.
> 
> Thanks,
> Adam

--
Walter Underwood
wun...@wunderwood.org





Re: Group By and Sum

2013-03-18 Thread Miguel

Hi Adam

Have you seen wiki about field collapsing?
http://wiki.apache.org/solr/FieldCollapsing

I think that this page help you to emule group by.



El 18/03/2013 17:48, Adam Harris escribió:

Hello All,

Pretty stuck here and I am hoping you might be the person to help me out. I am 
working with SOLR and JSONiq which are totally new to me and doing even the 
simplest of things is just escaping me. I know SQL pretty well however this 
simple requirement seems escape me. I'll jump right into it.

Here is the schema of my Core:

























I need to group by the month of BusinessDateTime and sum up NetSales and 
TransCount for a given date range. Now if this were SQL i would just right


SELECT sum(TransCount), sum(NetSales)

FROM Core

WHERE BusinessDateTime BETWEEN '2012/04/01' AND '2013/04/01'

GROUP BY MONTH(BusinessDateTime)

But ofcourse nothing is this simple with SOLR and/or JSONiq. I have tried 
messing around with Facet and Group but they never seem to work the way i want 
them to. For example here is a query i am currently playing with:


?wt=json

&indent=true

&q=*:*

&rows=0

&facet=true

&facet.date=BusinessDateTime

&facet.date.start=2012-02-01T00:00:01Z

&facet.date.end=2013-02-01T23:59:59Z

&facet.date.gap=%2B1MONTH

&group=true

&group.field=BusinessDateTime

&group.facet=true

&group.field=NetSales

Now the facet is working properly however it is returning the count of the 
documents however i need the sum of the NetSales and the TransCount fields 
instead.

Any help or suggestions would be greatly appreciated.

Thanks,
Adam





Search on final value in multi-valued field

2013-03-18 Thread Annette Newton
Are multi-valued fields ordered and if so is it possible to search on the
final value only?

-- 

Annette Newton

Database Administrator

ServiceTick Ltd



T:+44(0)1603 618326



Seebohm House, 2-4 Queen Street, Norwich, England NR2 4SQ

www.servicetick.com

*www.sessioncam.com*

-- 
*This message is confidential and is intended to be read solely by the 
addressee. The contents should not be disclosed to any other person or 
copies taken unless authorised to do so. If you are not the intended 
recipient, please notify the sender and permanently delete this message. As 
Internet communications are not secure ServiceTick accepts neither legal 
responsibility for the contents of this message nor responsibility for any 
change made to this message after it was forwarded by the original author.*


Re: Group By and Sum

2013-03-18 Thread Walter Underwood
Dang. Well, make your estimates clear. I would not be surprised if this took a 
few weeks to get something that worked but was too slow. It might require new 
Solr or Lucene features to make it fast.

It may be possible to put something together out of the existing features. If 
it is, the people on this list are the ones who could make that hack work.

wunder

On Mar 18, 2013, at 10:13 AM, Adam Harris wrote:

> I agree however the powers that be, being upper management, have decided that 
> we need to switch to SOLR, JSONiq and JavaScript MVC for all our reporting 
> needs. I would love to just keep using the SQL DB that we have been using but 
> alas I am not allowed to.
> 
> Thanks,
> Adam
> 
> -Original Message-
> From: Walter Underwood [mailto:wun...@wunderwood.org] 
> Sent: Monday, March 18, 2013 11:58 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Group By and Sum
> 
> You should use a relational database. Solr is not really designed for this 
> kind of query.
> 
> wunder
> 
> On Mar 18, 2013, at 9:48 AM, Adam Harris wrote:
> 
>> Hello All,
>> 
>> Pretty stuck here and I am hoping you might be the person to help me out. I 
>> am working with SOLR and JSONiq which are totally new to me and doing even 
>> the simplest of things is just escaping me. I know SQL pretty well however 
>> this simple requirement seems escape me. I'll jump right into it.
>> 
>> Here is the schema of my Core:
>> 
>> 
>> 
>> 
>> 
>>  
>> 
>>  > required="true"/>
>> 
>>  
>> 
>>  > required="true"/>
>> 
>>  > required="true"/>
>> 
>>  
>> 
>>  > required="true"/>
>> 
>>  > required="true"/>
>> 
>> 
>> 
>> 
>> 
>> I need to group by the month of BusinessDateTime and sum up NetSales and 
>> TransCount for a given date range. Now if this were SQL i would just right
>> 
>> 
>> SELECT sum(TransCount), sum(NetSales)
>> 
>> FROM Core
>> 
>> WHERE BusinessDateTime BETWEEN '2012/04/01' AND '2013/04/01'
>> 
>> GROUP BY MONTH(BusinessDateTime)
>> 
>> But ofcourse nothing is this simple with SOLR and/or JSONiq. I have tried 
>> messing around with Facet and Group but they never seem to work the way i 
>> want them to. For example here is a query i am currently playing with:
>> 
>> 
>> ?wt=json
>> 
>> &indent=true
>> 
>> &q=*:*
>> 
>> &rows=0
>> 
>> &facet=true
>> 
>> &facet.date=BusinessDateTime
>> 
>> &facet.date.start=2012-02-01T00:00:01Z
>> 
>> &facet.date.end=2013-02-01T23:59:59Z
>> 
>> &facet.date.gap=%2B1MONTH
>> 
>> &group=true
>> 
>> &group.field=BusinessDateTime
>> 
>> &group.facet=true
>> 
>> &group.field=NetSales
>> 
>> Now the facet is working properly however it is returning the count of the 
>> documents however i need the sum of the NetSales and the TransCount fields 
>> instead.
>> 
>> Any help or suggestions would be greatly appreciated.
>> 
>> Thanks,
>> Adam
> 
> --
> Walter Underwood
> wun...@wunderwood.org
> 
> 
> 

--
Walter Underwood
wun...@wunderwood.org





Re: Search on final value in multi-valued field

2013-03-18 Thread Jack Krupansky
Yes, order is maintained, but search is simply whether any of the multiple 
values matches.


-- Jack Krupansky

-Original Message- 
From: Annette Newton

Sent: Monday, March 18, 2013 1:40 PM
To: solr-user@lucene.apache.org
Subject: Search on final value in multi-valued field

Are multi-valued fields ordered and if so is it possible to search on the
final value only?

--

Annette Newton

Database Administrator

ServiceTick Ltd



T:+44(0)1603 618326



Seebohm House, 2-4 Queen Street, Norwich, England NR2 4SQ

www.servicetick.com

*www.sessioncam.com*

--
*This message is confidential and is intended to be read solely by the
addressee. The contents should not be disclosed to any other person or
copies taken unless authorised to do so. If you are not the intended
recipient, please notify the sender and permanently delete this message. As
Internet communications are not secure ServiceTick accepts neither legal
responsibility for the contents of this message nor responsibility for any
change made to this message after it was forwarded by the original author.* 



how to deploy customization in solr that requires dependency

2013-03-18 Thread Gian Maria Ricci
Hi to everyone,

 

I want to deploy a custom filter developed in java to Solr4, my problem is
that it requires to access Sql Server, so it depends from sqljdbc4.jar, but
I got a java.lang.ClassNotFoundException:
com.microsoft.sqlserver.jdbc.SQLServerDriver

 

I've copied the sqljdbc.jar file in the c:\tomcat\lib directory but it does
not work. So I've modified solrconfig.xml to load sqljdbc4.jar and now it
works, but I think that is a bettere solution to configure in some base
Tomcat directory instead in solrcfg.xml file of all cores. My question is,
where I can configure Tomcat to make sqljdbc4.jar available to every
customization filter in solar without referencing it directly in
solrcfg.xml?

 

I'm a .NET developer, so I'm not really familiar with TomCat and Java, and
surely I'm missing some configuration to modify.

 

Thanks to everyone.

 

Gian Maria

 



Re: structure of solr index

2013-03-18 Thread alxsss

 
---So,"search" time is in no way impacting by the existence or non-existence of 
stored values,




 What about memory? Would it require to increase memeory in order to have the 
same Qtime as in the case of indexed only fields?
For example in the case of indexed fields only index size is 5GB, average Qtime 
is 0.1 sec and memory is 10G. 
In case when the same fields are indexed and stored index size is 50GB. Will 
the Qtime be 0.1s + time for extracting of stored fields?

Another scenario is to store fields in hbase or cassandra, have only indexed 
fields in Solr and after getting id field from solr extract stored values from 
hbase or cassandra. Will this setup be faster than the  one with stored fields 
in Solr?

Thanks.
Alex.

 

-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Sat, Mar 16, 2013 9:53 am
Subject: Re: structure of solr index


"Search" depends only on the "index". But... returning field values for each 
of the matched documents does require access to the "stored" values. So, 
"search" time is in no way impacting by the existence or non-existence of 
stored values, but total query processing time would of course include both 
search time and the time to access and format the stored field values.

-- Jack Krupansky

-Original Message- 
From: alx...@aim.com
Sent: Saturday, March 16, 2013 12:48 PM
To: solr-user@lucene.apache.org
Subject: Re: structure of solr index

Hi,

So, will search time be the same for the case when fields are indexed only 
vs  the case when they are indexed and stored?



Thanks.
Alex.



-Original Message-
From: Otis Gospodnetic 
To: solr-user 
Sent: Fri, Mar 15, 2013 8:09 pm
Subject: Re: structure of solr index


Hi,

I think you are asking if the original/raw content of those fields will be
read.  No, it won't, not for the search itself.  If you want to
retrieve/return those fields then, of course, they will be read for the
documents being returned.

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Fri, Mar 15, 2013 at 2:41 PM,  wrote:

> Hi,
>
> I wondered if solr searches on indexed fields only or on entire index? In
> more detail, let say I have fields id,  title and content, all  indexed,
> stored. Will a search send all these fields to memory or only indexed part
> of these fields?
>
> Thanks.
> Alex.
>
>
>



 


Re: Replica is unable to recover because leader doesn't think it is the leader (Solr 4.1)

2013-03-18 Thread Mark Miller
Hmm…

Sounds like it's a defensive mechanism we have where a leader will check it's 
own state about whether it thinks it's the leader with the zk info. In this 
case it's own state is not convinced of it's leadership. That's just a volatile 
boolean that gets flipped on when elected.

What do the election nodes in ZooKeeper say? Who do they think the leader is?

Something is off, but I'm kind of surprised restarting the leader doesn't fix 
it. Someone else should register as the leader or the restarted node should 
reclaim it's spot.

I have no idea if this is solved in 4.2 or not since I don't really know what's 
happened, but I'd love to get to the bottom of it.

After setting the leader volatile boolean to true, the only way it goes false 
other than restart is session expiration. In that case we do flip to false - 
but session expiration should also cause the leader node to drop…


- Mark

On Mar 18, 2013, at 1:57 PM, Timothy Potter  wrote:

> Having an issue running on a nightly build of Solr 4.1 (tag -
> 4.1.0.2013.01.10.20.44.27)
> 
> I had a replica fail and when trying to bring it back online, recovery
> fails because the leader responds with "We are not the leader" (see trace
> below).
> 
> SEVERE: org.apache.solr.common.SolrException: We are not the leader
> 
>at
> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:907)
> 
>at
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
> 
>at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> 
>at
> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:365)
> 
>at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:174)
> 
>at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
> 
> ...
> 
> The worrisome part is the clusterstate.json seems to show this node (ADDR1)
> is the leader (I obfuscated addresses using ADDR1 and 2):
> 
>  "shard5":{
> 
>"range":"b8e3-c71b",
> 
>"replicas":{
> 
>  "ADDR1:8983_solr_solr_signal":{
> 
>"shard":"shard5",
> 
>"roles":null,
> 
>"state":"active",
> 
>"core":"solr_signal",
> 
>"collection":"solr_signal",
> 
>"node_name":"ADDR1:8983_solr",
> 
>"base_url":"http://ADDR1:8983/solr";,
> 
>  *  "leader":"true"},*
> 
>  "ADDR2:8983_solr_solr_signal":{
> 
>"shard":"shard5",
> 
>"roles":null,
> 
>   * "state":"recovering",*
> 
>"core":"solr_signal",
> 
>"collection":"solr_signal",
> 
>"node_name":"ADDR2:8983_solr",
> 
>"base_url":"http://ADDR2:8983/solr"}}},
> 
> 
> I assume the obvious answer is to upgrade to 4.2. I'm willing to go down
> that path but wanted to see if there was something quick I could do to get
> the leader to start thinking it is the leader again. Restarting it doesn't
> seem to do the trick.



Re: how to deploy customization in solr that requires dependency

2013-03-18 Thread Dmitry Kan
Hi,

See here, might help:

http://wiki.apache.org/solr/SolrPlugins#How_to_Load_Plugins

We don't use multicore functionality of SOLR, so we decided to bundle SOLR
dependencies into the war file of the solr web app.

Regards,

Dmitry

On Mon, Mar 18, 2013 at 7:47 PM, Gian Maria Ricci
wrote:

> Hi to everyone,
>
>
>
> I want to deploy a custom filter developed in java to Solr4, my problem is
> that it requires to access Sql Server, so it depends from sqljdbc4.jar, but
> I got a java.lang.ClassNotFoundException:
> com.microsoft.sqlserver.jdbc.SQLServerDriver
>
>
>
> I've copied the sqljdbc.jar file in the c:\tomcat\lib directory but it does
> not work. So I've modified solrconfig.xml to load sqljdbc4.jar and now it
> works, but I think that is a bettere solution to configure in some base
> Tomcat directory instead in solrcfg.xml file of all cores. My question is,
> where I can configure Tomcat to make sqljdbc4.jar available to every
> customization filter in solar without referencing it directly in
> solrcfg.xml?
>
>
>
> I'm a .NET developer, so I'm not really familiar with TomCat and Java, and
> surely I'm missing some configuration to modify.
>
>
>
> Thanks to everyone.
>
>
>
> Gian Maria
>
>
>
>


Surge 2013 CFP open

2013-03-18 Thread Katherine Jeschke
The Surge 2013 CFP is open. For details or to submit a paper, please visit
http://surge.omniti.com/2013

-- 
Katherine Jeschke
Director of Marketing and Creative Services
OmniTI Computer Consulting, Inc.
11830 West Market Place, Suite F
Fulton, MD 20759
O: 240-646-0770, 222
F: 301-497-2001
C: 443/643-6140
omniti.com
Surge 2013 

The information contained in this electronic message and any attached
documents is privileged, confidential, and protected from disclosure.  If
you are not the intended recipient, note that any review, disclosure,
copying, distribution, or use of the contents of this electronic message or
any attached documents is prohibited. If you have received this
communication in error, please destroy it and notify us immediately by
telephone (1-443-325-1360) or by electronic mail (i...@omniti.com). Thank
you.


Re: Replica is unable to recover because leader doesn't think it is the leader (Solr 4.1)

2013-03-18 Thread Timothy Potter
Hi Mark,

Thanks for responding.

Looking under /collections/solr_signal/leader_elect/shard5/election/ there
are 2 nodes:

161276082334072879-ADDR1:8983_solr_solr_signal-n_53 - *Mon Mar 18
17:36:41 UTC 2013*
161276082334072880-ADDR2:8983_solr_solr_signal-n_56 - *Mon Mar 18
17:48:22 UTC 2013*

So looks like the election of ADDR2 (the node that cannot recover) is later
than ADDR1 (node is still online and serving requests)

Could I just delete that newer node from ZK?
*
*
*Cheers,*
*Tim*
*
*
On Mon, Mar 18, 2013 at 12:04 PM, Mark Miller  wrote:

> Hmm…
>
> Sounds like it's a defensive mechanism we have where a leader will check
> it's own state about whether it thinks it's the leader with the zk info. In
> this case it's own state is not convinced of it's leadership. That's just a
> volatile boolean that gets flipped on when elected.
>
> What do the election nodes in ZooKeeper say? Who do they think the leader
> is?
>
> Something is off, but I'm kind of surprised restarting the leader doesn't
> fix it. Someone else should register as the leader or the restarted node
> should reclaim it's spot.
>
> I have no idea if this is solved in 4.2 or not since I don't really know
> what's happened, but I'd love to get to the bottom of it.
>
> After setting the leader volatile boolean to true, the only way it goes
> false other than restart is session expiration. In that case we do flip to
> false - but session expiration should also cause the leader node to drop…
>
>
> - Mark
>
> On Mar 18, 2013, at 1:57 PM, Timothy Potter  wrote:
>
> > Having an issue running on a nightly build of Solr 4.1 (tag -
> > 4.1.0.2013.01.10.20.44.27)
> >
> > I had a replica fail and when trying to bring it back online, recovery
> > fails because the leader responds with "We are not the leader" (see trace
> > below).
> >
> > SEVERE: org.apache.solr.common.SolrException: We are not the leader
> >
> >at
> >
> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:907)
> >
> >at
> >
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
> >
> >at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >
> >at
> >
> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:365)
> >
> >at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:174)
> >
> >at
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
> >
> > ...
> >
> > The worrisome part is the clusterstate.json seems to show this node
> (ADDR1)
> > is the leader (I obfuscated addresses using ADDR1 and 2):
> >
> >  "shard5":{
> >
> >"range":"b8e3-c71b",
> >
> >"replicas":{
> >
> >  "ADDR1:8983_solr_solr_signal":{
> >
> >"shard":"shard5",
> >
> >"roles":null,
> >
> >"state":"active",
> >
> >"core":"solr_signal",
> >
> >"collection":"solr_signal",
> >
> >"node_name":"ADDR1:8983_solr",
> >
> >"base_url":"http://ADDR1:8983/solr";,
> >
> >  *  "leader":"true"},*
> >
> >  "ADDR2:8983_solr_solr_signal":{
> >
> >"shard":"shard5",
> >
> >"roles":null,
> >
> >   * "state":"recovering",*
> >
> >"core":"solr_signal",
> >
> >"collection":"solr_signal",
> >
> >"node_name":"ADDR2:8983_solr",
> >
> >"base_url":"http://ADDR2:8983/solr"}}},
> >
> >
> > I assume the obvious answer is to upgrade to 4.2. I'm willing to go down
> > that path but wanted to see if there was something quick I could do to
> get
> > the leader to start thinking it is the leader again. Restarting it
> doesn't
> > seem to do the trick.
>
>


Re: structure of solr index

2013-03-18 Thread Jack Krupansky
Certainly if you are actually going to reference stored values they will add 
on to the TOTAL time and memory, but still have zero impact on the actual 
search time or memory for search. Searching for documents and returning 
results are two separate steps. Highlighting and faceting are other separate 
steps that can add significantly to the TOTAL request processing time, but 
still have ZERO impact on the actual search time or memory for search 
itself.


Add &debugQuery=true to your query request and there will be a "timing" 
section that breaks down some of the major steps. The QueryComponent is the 
actual search time.


Just to be clear, QTime is neither the actual search time nor the total 
request processing time, but somewhere in the middle. QTime will include 
actual search time (the QueryComponent in the timing section), highlighting, 
faceting etc., but not include the time to output the actual response or 
transmit it over the network or parse it on the receiving end. Nor does 
QTime include the time it takes to format the original request, send it to 
Solr, or for Solr to parse the raw Solr request.


Whether it is better to simply return document identifiers and get the 
stored values elsewhere depends completely on the specific nature of the 
application. But, Solr is optimized around the application having complete 
control over what subset of fields should be returned on any given query. 
So, Solr gives you the best of both worlds.


You could also look at a commercial product like DataStax Enterprise (DSE) 
which does in fact combine Cassandra for storing values and Solr for 
Indexing. It does it transparently so that the app issues fairly standard 
Solr queries, but the stored values come from Cassandra under the hood 
rather than from Lucene even though Lucene's indexing is still used for 
search on all fields. Under the hood, DSE is reading only the document 
unique key field value from the index and then using that key to access the 
field/column values from Cassandra and then merging them into a standard 
Solr response.


See:
http://3.datastax.com/

-- Jack Krupansky

-Original Message- 
From: alx...@aim.com

Sent: Monday, March 18, 2013 1:48 PM
To: solr-user@lucene.apache.org
Subject: Re: structure of solr index



---So,"search" time is in no way impacting by the existence or non-existence 
of

stored values,




What about memory? Would it require to increase memeory in order to have the 
same Qtime as in the case of indexed only fields?
For example in the case of indexed fields only index size is 5GB, average 
Qtime is 0.1 sec and memory is 10G.
In case when the same fields are indexed and stored index size is 50GB. Will 
the Qtime be 0.1s + time for extracting of stored fields?


Another scenario is to store fields in hbase or cassandra, have only indexed 
fields in Solr and after getting id field from solr extract stored values 
from hbase or cassandra. Will this setup be faster than the  one with stored 
fields in Solr?


Thanks.
Alex.



-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Sat, Mar 16, 2013 9:53 am
Subject: Re: structure of solr index


"Search" depends only on the "index". But... returning field values for each
of the matched documents does require access to the "stored" values. So,
"search" time is in no way impacting by the existence or non-existence of
stored values, but total query processing time would of course include both
search time and the time to access and format the stored field values.

-- Jack Krupansky

-Original Message- 
From: alx...@aim.com

Sent: Saturday, March 16, 2013 12:48 PM
To: solr-user@lucene.apache.org
Subject: Re: structure of solr index

Hi,

So, will search time be the same for the case when fields are indexed only
vs  the case when they are indexed and stored?



Thanks.
Alex.



-Original Message-
From: Otis Gospodnetic 
To: solr-user 
Sent: Fri, Mar 15, 2013 8:09 pm
Subject: Re: structure of solr index


Hi,

I think you are asking if the original/raw content of those fields will be
read.  No, it won't, not for the search itself.  If you want to
retrieve/return those fields then, of course, they will be read for the
documents being returned.

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Fri, Mar 15, 2013 at 2:41 PM,  wrote:


Hi,

I wondered if solr searches on indexed fields only or on entire index? In
more detail, let say I have fields id,  title and content, all  indexed,
stored. Will a search send all these fields to memory or only indexed part
of these fields?

Thanks.
Alex.










Re: Replica is unable to recover because leader doesn't think it is the leader (Solr 4.1)

2013-03-18 Thread Timothy Potter
Hi Mark,

I figured out what got the cluster into this bad state. I did a rolling
restart and one of the JVM processes wasn't killed off before I restarted
it, i.e. there were two Solr JVM processes running for the same shard.
(Perhaps some things happen in Solr before Jetty fails to bind on the port
already in use??). Bottom-line - kill processes before you restart them.
Thanks for the help.

Tim

On Mon, Mar 18, 2013 at 12:27 PM, Timothy Potter wrote:

> Hi Mark,
>
> Thanks for responding.
>
> Looking under /collections/solr_signal/leader_elect/shard5/election/ there
> are 2 nodes:
>
> 161276082334072879-ADDR1:8983_solr_solr_signal-n_53 - *Mon Mar 18
> 17:36:41 UTC 2013*
> 161276082334072880-ADDR2:8983_solr_solr_signal-n_56 - *Mon Mar 18
> 17:48:22 UTC 2013*
>
> So looks like the election of ADDR2 (the node that cannot recover) is
> later than ADDR1 (node is still online and serving requests)
>
> Could I just delete that newer node from ZK?
> *
> *
> *Cheers,*
> *Tim*
> *
> *
> On Mon, Mar 18, 2013 at 12:04 PM, Mark Miller wrote:
>
>> Hmm…
>>
>> Sounds like it's a defensive mechanism we have where a leader will check
>> it's own state about whether it thinks it's the leader with the zk info. In
>> this case it's own state is not convinced of it's leadership. That's just a
>> volatile boolean that gets flipped on when elected.
>>
>> What do the election nodes in ZooKeeper say? Who do they think the leader
>> is?
>>
>> Something is off, but I'm kind of surprised restarting the leader doesn't
>> fix it. Someone else should register as the leader or the restarted node
>> should reclaim it's spot.
>>
>> I have no idea if this is solved in 4.2 or not since I don't really know
>> what's happened, but I'd love to get to the bottom of it.
>>
>> After setting the leader volatile boolean to true, the only way it goes
>> false other than restart is session expiration. In that case we do flip to
>> false - but session expiration should also cause the leader node to drop…
>>
>>
>> - Mark
>>
>> On Mar 18, 2013, at 1:57 PM, Timothy Potter  wrote:
>>
>> > Having an issue running on a nightly build of Solr 4.1 (tag -
>> > 4.1.0.2013.01.10.20.44.27)
>> >
>> > I had a replica fail and when trying to bring it back online, recovery
>> > fails because the leader responds with "We are not the leader" (see
>> trace
>> > below).
>> >
>> > SEVERE: org.apache.solr.common.SolrException: We are not the leader
>> >
>> >at
>> >
>> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:907)
>> >
>> >at
>> >
>> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
>> >
>> >at
>> >
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>> >
>> >at
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:365)
>> >
>> >at
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:174)
>> >
>> >at
>> >
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
>> >
>> > ...
>> >
>> > The worrisome part is the clusterstate.json seems to show this node
>> (ADDR1)
>> > is the leader (I obfuscated addresses using ADDR1 and 2):
>> >
>> >  "shard5":{
>> >
>> >"range":"b8e3-c71b",
>> >
>> >"replicas":{
>> >
>> >  "ADDR1:8983_solr_solr_signal":{
>> >
>> >"shard":"shard5",
>> >
>> >"roles":null,
>> >
>> >"state":"active",
>> >
>> >"core":"solr_signal",
>> >
>> >"collection":"solr_signal",
>> >
>> >"node_name":"ADDR1:8983_solr",
>> >
>> >"base_url":"http://ADDR1:8983/solr";,
>> >
>> >  *  "leader":"true"},*
>> >
>> >  "ADDR2:8983_solr_solr_signal":{
>> >
>> >"shard":"shard5",
>> >
>> >"roles":null,
>> >
>> >   * "state":"recovering",*
>> >
>> >"core":"solr_signal",
>> >
>> >"collection":"solr_signal",
>> >
>> >"node_name":"ADDR2:8983_solr",
>> >
>> >"base_url":"http://ADDR2:8983/solr"}}},
>> >
>> >
>> > I assume the obvious answer is to upgrade to 4.2. I'm willing to go down
>> > that path but wanted to see if there was something quick I could do to
>> get
>> > the leader to start thinking it is the leader again. Restarting it
>> doesn't
>> > seem to do the trick.
>>
>>
>


Re: Incorrect snippets using FastVectorHighlighter

2013-03-18 Thread Jochen Just
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

So just to be clear:
There is no possibility to highlight results, if I use variable gram size.
Neither the original highlighter nor FVH do the job.
Or am I missing something?
Btw does any documentation exits how the VFH works?

Jochen
Am 18.03.2013 15:00, schrieb Koji Sekiguchi:
> Hi Jochen,
> 
> There is a restriction in FVH. FVH cannot deal with variable gram size. That 
> is, minGramSize == maxGramSize in your NGramFilterFactory setting.
> 
> koji


- -- 
Jochen Just   Fon:   (++49) 711/28 07 57-193
avono AG  Mobil: (++49) 172/73 85 387
Breite Straße 2   Mail:  jochen.j...@avono.de
70173 Stuttgart   WWW:   http://www.avono.de
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQIcBAEBAgAGBQJRR2HSAAoJEP1xbhgWUHmS8aAP/Ao6OudBZmWt3u0IWduLb5t6
ryjJeb6jCSH5RrgDGjnbxT0wgap9NWQIvH53Bwy2Y+T89ruo27mOSywZB4aYOb+l
XEl6ZfJouBf1rTzcMacIRdz4mIj/YFHq+SS724JDaDiwbiLz7Ku8/66dppFTmmN+
+lj52pFYgzzDpP63JPtGEwCBdvJ8jgjNundY94dXlyW33ZMWvPu7sdtQZP2YVQdB
RjHcoQN0fu38+5l30t1MZrm9OpDlV2GugyEk99JpKfcnEFFmYUgS9BHI9aiPg7K3
hy5lpE1ooub78vlB1jExDSRTTEJn0V/MIEUGRvzQQDS94tdhvOidxA0/zeEiBaou
tYMJKKfw8AJQN0ag16DjbCWte/9bQwgCiTswSfrpDzaIPHnqfXw5E2ABeNM7k8Q1
E9iPwsfDG8yy/MZRR83bXWl6fhtYGcW8W6GlNLB5a1B81qKM6Ld6pu9uaGpr58Aw
JKTVrjXC02i0/kYmRx7C8KFQ14UkqKqPZXtsHOmVNVXQlYX6wh9OvmjWOsWHqgNz
Y0KyXaJfL4DQlUlvuCjj2bAplHPfbwvtbtO6gIFIisKvkbk3RziSQ9W67XLTsBEV
cyxRrNbqcrAP+JLANXdP9JC6K54Ll23dFesl9Q8idaqCXuDubc1w5shMXfAGQazt
Ba5Se0fD5QibHe90SsO/
=4Nxk
-END PGP SIGNATURE-


RE: Query.toString printing binary in the output...

2013-03-18 Thread Andrew Lundgren
I am sorry, I don't follow what you mean by debug=query.  Can you elaborate on 
that a bit?

Thanks!

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Sunday, March 17, 2013 8:09 AM
To: solr-user@lucene.apache.org
Subject: Re: Query.toString printing binary in the output...

Hmmm, without looking at the code, somehow when you specify debug=query you get 
readable results, maybe that code would be a place to start?

And are you looking for the parsed output? Otherwise you could print original 
query.

Not much help
Erick


On Fri, Mar 15, 2013 at 3:24 PM, Andrew Lundgren
wrote:

> We use the toString call on the query in our logs.  For some numeric 
> types, the encoded form of the number is being printed instead of the 
> readable form.
>
> This makes tail and some other tools very unhappy...
>
> Here is a partial example of a query.toString() that would have had 
> binary in it.  As a short term work around I replaced all 
> non-printable characters in the string with an '_'.
>
> (collection_id:`__z_[^0.027 collection_id:`__nB+^0.026
> collection_id:`__Zl_^0.025 collection_id:`__i49^0.024
> collection_id:`__Pq%^0.023 collection_id:`__VCS^0.022
> collection_id:`__WbH^0.021 collection_id:`__Yu_^0.02
> collection_id:`__UF&^0.019 collection_id:`__I2g^0.018
> collection_id:`__PP_^0.01699 collection_id:`__Ysv^0.01599
> collection_id:`__Oe_^0.01499 collection_id:`__Ysw^0.01399
> collection_id:`__Wi_^0.01298 collection_id:`__fLi^0.01198
> collection_id:`__XRk^0.01098 collection_id:`__Uz[^0.00998
> collection_id:`__SE_^0.00898 collection_id:`__Ysx^0.00798
> collection_id:`__Ysh^0.006974 collection_id:`__fLh^0.005973 
> collection_id:`__f _^0.00497 collection_id:`__`^C^0.00397
> collection_id:`__fKM^0.00297 collection_id:`__Szo^0.00197 
> collection_id:`__f ]^9.7E-4)
>
> But, as you can see, that is less than useful...
>
> I spent some time looking at the source and found that Term does not 
> contain the type of the embedded data.  Any possible solutions to this 
> short of walking the query and getting the type of each field from the 
> schema and creating my own print function?
>
> Thanks!
>
> --
> Andrew
>
>
>
>
>  NOTICE: This email message is for the sole use of the intended
> recipient(s) and may contain confidential and privileged information. 
> Any unauthorized review, use, disclosure or distribution is 
> prohibited. If you are not the intended recipient, please contact the 
> sender by reply email and destroy all copies of the original message.
>
>


 NOTICE: This email message is for the sole use of the intended recipient(s) 
and may contain confidential and privileged information. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please contact the sender by reply email and destroy all 
copies of the original message.



Re: Group By and Sum

2013-03-18 Thread Alan Woodward
Hi Adam,

Have a look at the stats component: http://wiki.apache.org/solr/StatsComponent. 
 In your case, I think you'd need to add an extra field for your month, and 
then run a query filtered by your date range with stats.field=NetSales, 
stats.field=TransCount, and stats.facet=month.

Make sure you use Solr 4.2 for this, by the way, as it's massively faster - 
I've found stats queries over ~500,000 documents dropping from 60 seconds to 2 
seconds with an upgrade from 4.0 to 4.2.

Alan Woodward
www.flax.co.uk


On 18 Mar 2013, at 16:48, Adam Harris wrote:

> Hello All,
> 
> Pretty stuck here and I am hoping you might be the person to help me out. I 
> am working with SOLR and JSONiq which are totally new to me and doing even 
> the simplest of things is just escaping me. I know SQL pretty well however 
> this simple requirement seems escape me. I'll jump right into it.
> 
> Here is the schema of my Core:
> 
> 
> 
> 
> 
>   
> 
>required="true"/>
> 
>   
> 
>required="true"/>
> 
>required="true"/>
> 
>   
> 
>required="true"/>
> 
>required="true"/>
> 
> 
> 
> 
> 
> I need to group by the month of BusinessDateTime and sum up NetSales and 
> TransCount for a given date range. Now if this were SQL i would just right
> 
> 
> SELECT sum(TransCount), sum(NetSales)
> 
> FROM Core
> 
> WHERE BusinessDateTime BETWEEN '2012/04/01' AND '2013/04/01'
> 
> GROUP BY MONTH(BusinessDateTime)
> 
> But ofcourse nothing is this simple with SOLR and/or JSONiq. I have tried 
> messing around with Facet and Group but they never seem to work the way i 
> want them to. For example here is a query i am currently playing with:
> 
> 
> ?wt=json
> 
> &indent=true
> 
> &q=*:*
> 
> &rows=0
> 
> &facet=true
> 
> &facet.date=BusinessDateTime
> 
> &facet.date.start=2012-02-01T00:00:01Z
> 
> &facet.date.end=2013-02-01T23:59:59Z
> 
> &facet.date.gap=%2B1MONTH
> 
> &group=true
> 
> &group.field=BusinessDateTime
> 
> &group.facet=true
> 
> &group.field=NetSales
> 
> Now the facet is working properly however it is returning the count of the 
> documents however i need the sum of the NetSales and the TransCount fields 
> instead.
> 
> Any help or suggestions would be greatly appreciated.
> 
> Thanks,
> Adam



Re: Making tika process mail attachments eludes me

2013-03-18 Thread Marcos Garcia
Hi Leif

I've had the same problem. I tried with 4.2.0 as well, in both fedora 17 and 
centos6, using java-6 and java-7 (openjdk and oracel/sun as well). I could 
NEVER 
use example-DIH against a mailbox having mails attachments. Only mails without 
them, even if they were HTML, but as long as I included at least 2 MIME parts 
(body + attachment), they disappeared from the indexing.

So I decided to put some traces in the code, and I found out that the trace 
"isMimeType #2" is never shown. After I've modified the code, I'm sure that for 
every mail with attachment I send, the code "part.getContent()" returns null, 
hence the unexpected result.

#FILE: solr-4.2.0/solr/contrib/dataimporthandler-
extras/src/java/org/apache/solr/handler/dataimport/MailEntityProcessor.java

  public void addPartToDocument(Part part, Map row, boolean 
outerMost) throws Exception {
LOG.info("Inside addPartToDocument start");
if (part instanceof Message) {
  LOG.info("Inside addPartToDocument.part instanceof message");
  addEnvelopToDocument(part, row);
}

String ct = part.getContentType();
ContentType ctype = new ContentType(ct);
if (part.isMimeType("multipart/*")) {
  LOG.info("Inside addPartToDocument.isMimeType #1 "+ ct);
  if (part.getContent() != null){
Multipart mp = (Multipart) part.getContent();
LOG.info("Inside addPartToDocument.isMimeType #2");
int count = mp.getCount();
LOG.info("Inside addPartToDocument.isMimeType #3 count 
is"+String.valueOf(count));
if (part.isMimeType("multipart/alternative"))
  count = 1;
for (int i = 0; i < count; i++)
{
  LOG.info("Inside addPartToDocument.isMimeType.for()");
  addPartToDocument(mp.getBodyPart(i), row, false);
}
  }
} else if (part.isMimeType("message/rfc822")) {
  addPartToDocument((Part) part.getContent(), row, false);
} else {
  LOG.info("Inside addPartToDocument.ELSE #1");
  String disp = part.getDisposition();
  if (!processAttachment || (disp != null && 
disp.equalsIgnoreCase(Part.ATTACHMENT)))return;
  LOG.info("Inside addPartToDocument.ELSE #2");
  InputStream is = part.getInputStream();
  String fileName = part.getFileName();
  Metadata md = new Metadata();
  md.set(HttpHeaders.CONTENT_TYPE, 
ctype.getBaseType().toLowerCase(Locale.ROOT));
  md.set(TikaMetadataKeys.RESOURCE_NAME_KEY, fileName);
  String content = tika.parseToString(is, md);
  LOG.info("Inside addPartToDocument.ELSE #3");
  if (disp != null && disp.equalsIgnoreCase(Part.ATTACHMENT)) {
LOG.info("Inside addPartToDocument.ELSE #4aaa");
if (row.get(ATTACHMENT) == null)
  row.put(ATTACHMENT, new ArrayList());
List contents = (List) row.get(ATTACHMENT);
contents.add(content);
row.put(ATTACHMENT, contents);
if (row.get(ATTACHMENT_NAMES) == null)
  row.put(ATTACHMENT_NAMES, new ArrayList());
List names = (List) row.get(ATTACHMENT_NAMES);
names.add(fileName);
row.put(ATTACHMENT_NAMES, names);
  } else {
LOG.info("Inside addPartToDocument.ELSE #4bis");
if (row.get(CONTENT) == null)
  row.put(CONTENT, new ArrayList());
List contents = (List) row.get(CONTENT);
contents.add(content);
row.put(CONTENT, contents);
  }
}
  }

My solrconfig is the same as included in the example-DIH folder.

I've found in google that javamail+activation could cause problems if the 
version included in the application doesn't match the ones that are now 
included 
in the JRE. I tried removing them, putting newer versions, etc, but no result.

I believe that the handling of the multipart MIME lacks some error checking, 
and 
it is probably related to the content outside the MIME boundaries (in my 
example, the text "This is a multi-part message in MIME format."):

I really hope that some SOLR developer can have a look, we cannot be the only 
ones having this problem. And I've spent almost twenty hours debugging this.

Regards

PS: Example of mail that doesn't get processed:
Return-Path: marcos.gar...@savoirfairelinux.com
Received: from mail.savoirfairelinux.com (LHLO mail.savoirfairelinux.com)
 (192.168.52.6) by mail.savoirfairelinux.com with LMTP; Mon, 18 Mar 2013
 12:10:22 -0400 (EDT)
Received: from localhost (localhost [127.0.0.1])
by mail.savoirfairelinux.com (Postfix) with ESMTP id 39CA425819D
for ; Mon, 18 Mar 2013 12:10:22 -0400 
(EDT)
X-Virus-Scanned: amavisd-new at mail.savoirfairelinux.com
X-Spam-Flag: NO
X-Spam-Score: -2.9
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 tagged_above=-10 required=4.4
tests=[ALL_TRUSTED=-1, BAYES_00=-1.9] autolearn=ham
Received: from mail.savoirfairelinux.com ([127.0.0.1])
by localhost (mail.savoirfairelinux.com [127.0.0.1]) (amavisd-new, port 
10024)
with ESMTP id fxeONJBdw8JA for ;
Mon, 18 Mar 2013 12:10:21 -04

Re: Search on final value in multi-valued field

2013-03-18 Thread Alexandre Rafalovitch
So, if you really want that you need to clone the field and keep only the
final value in the clone. In 4.1, there are helpers for that:
http://lucene.apache.org/solr/4_1_0/solr-core/org/apache/solr/update/processor/LastFieldValueUpdateProcessorFactory.html

You don't have to store the copied value, just index it.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Mon, Mar 18, 2013 at 1:46 PM, Jack Krupansky wrote:

> Yes, order is maintained, but search is simply whether any of the multiple
> values matches.
>
> -- Jack Krupansky
>
> -Original Message- From: Annette Newton
> Sent: Monday, March 18, 2013 1:40 PM
> To: solr-user@lucene.apache.org
> Subject: Search on final value in multi-valued field
>
>
> Are multi-valued fields ordered and if so is it possible to search on the
> final value only?
>
> --
>
> Annette Newton
>
> Database Administrator
>
> ServiceTick Ltd
>
>
>
> T:+44(0)1603 618326
>
>
>
> Seebohm House, 2-4 Queen Street, Norwich, England NR2 4SQ
>
> www.servicetick.com
>
> *www.sessioncam.com*
>
> --
> *This message is confidential and is intended to be read solely by the
> addressee. The contents should not be disclosed to any other person or
> copies taken unless authorised to do so. If you are not the intended
> recipient, please notify the sender and permanently delete this message. As
> Internet communications are not secure ServiceTick accepts neither legal
> responsibility for the contents of this message nor responsibility for any
> change made to this message after it was forwarded by the original
> author.*
>


strange behaviour of wordbreak spellchecker in solr cloud

2013-03-18 Thread alxsss
Hello,

I try to use wordbreak spellchecker in solr-4.2 with cloud feature. We have two 
server with one shard in each of them.

curl 'server1:8983/solr/test/testhandler?q=paulusoles&indent=true&rows=10'
curl 'server2:8983/solr/test/testhandler?q=paulusoles&indent=true&rows=10'

does not return any results in spellchecker. However, if I specify 
distrib=false only one of these has spellchecker results.

curl 
'server1:8983/solr/test/testhandler?q=paulusoles&indent=true&rows=10&distrib=false'

no spellcheler results 

curl 
'server2:8983/solr/test/testhandler?q=paulusoles&indent=true&rows=10&distrib=false'
returns spellcheker results.


My testhandler and select handlers are as follows




edismax
explicit
0.01
host^30  content^0.5 title^1.2 
site^25 content^10 title^22
url,id,title

3<-1 5<-3 6<90%
1

true
content
regex
165
default


direct
wordbreak
on
true
false
2




 spellcheck





  

 
   explicit
   10
   
 







   
 spellcheck





is this a bug or something else has to be done?


Thanks.
Alex.



Help getting a document by unique ID

2013-03-18 Thread Brian Hurt
So here's the problem I'm trying to solve: in my use case, all my
documents have a unique id associated with them (a string), and I very
often need to get them by id.  Currently I'm doing a search on id, and
this takes long enough it's killing my performance.  Now, it looks
like there is a GET call in the REST interface which does exactly what
I need, but I'm using the solrj interface.

So my two questions are:

1. Is GET the right function I should be using?  Or should I be using
some other function, or storing copies of the documents some where
else entirely for fast id-based retrieval?

2. How do I call GET with solrj?  I've googled for how to do this, and
haven't come up with anything.

Thanks.

Brian


Shingles Filter Query time behaviour

2013-03-18 Thread Catala, Francois
Hello,

I am trying to have the input "darkknight" match documents containing either 
"dark knight" and "darkknight".
The reverse should also work ("dark knight" matching "dark knight" and 
"darkknight") but it doesn't. Does anyone know why?


When I run the following query I get the expected response with the two 
documents matched


  0
  1
  
name
true
name:darkknight
xml
  


  
Batman, the darkknight Rises
  
Batman, the dark knight Rises




HOWEVER when I run the same query looking for "dark knight" two words I get 
only 1 document matched as shows the response :


  0
  0
  
name
true
name:dark knight
xml
  


  
Batman, the dark knight Rises



I have these documents as input :


  bat1
  Batman, the dark knight Rises


  bat2
  Batman, the darkknight Rises


And I defined this analyser :

  



  
  



  


Re: how to deploy customization in solr that requires dependency

2013-03-18 Thread Shawn Heisey

On 3/18/2013 11:47 AM, Gian Maria Ricci wrote:

I want to deploy a custom filter developed in java to Solr4, my problem is
that it requires to access Sql Server, so it depends from sqljdbc4.jar, but
I got a java.lang.ClassNotFoundException:
com.microsoft.sqlserver.jdbc.SQLServerDriver


Solr has a property "solr.solr.home" (which defaults to solr in the 
current working directory) ... this is the directory where solr.xml 
lives.  By default, Solr is supposed to look for a lib directory in this 
location, from which jar files can be loaded by all cores.  I make this 
explicit in my config with a 'sharedLib="lib"' attribute on the solr tag 
in solr.xml, but it's my understanding that this config is not strictly 
required.


I put all jar files required by Solr here.  My solr.solr.home is set to 
/index/solr4:


ncindex@bigindy5 ~ $ ls /index/solr4/lib
icu4j-49.1.jar
lucene-analyzers-icu-4.3-SNAPSHOT.jar
mysql-connector-java-5.1.22-bin.jar
solr-dataimporthandler-4.3-SNAPSHOT.jar
solr-dataimporthandler-extras-4.3-SNAPSHOT.jar


Thanks,
Shawn



Re: Help getting a document by unique ID

2013-03-18 Thread Jack Krupansky
Hmmm... if query by your unique key field is killing your performance, maybe 
you have some larger problem to address. How bad is it? Are you using the 
string field type? How long are your ids?


The only thing the real-time GET API gives you is more immediate access to 
recently added, uncommitted data. Accessing older, committed data will be no 
faster. But if accessing that recent data is what you are after, real-time 
GET may do the trick.


I don't recall seeing changes to add it to SolrJ.

Realtime Get:
http://searchhub.org/2011/09/07/realtime-get/

-- Jack Krupansky

-Original Message- 
From: Brian Hurt

Sent: Monday, March 18, 2013 6:08 PM
To: solr-user@lucene.apache.org
Subject: Help getting a document by unique ID

So here's the problem I'm trying to solve: in my use case, all my
documents have a unique id associated with them (a string), and I very
often need to get them by id.  Currently I'm doing a search on id, and
this takes long enough it's killing my performance.  Now, it looks
like there is a GET call in the REST interface which does exactly what
I need, but I'm using the solrj interface.

So my two questions are:

1. Is GET the right function I should be using?  Or should I be using
some other function, or storing copies of the documents some where
else entirely for fast id-based retrieval?

2. How do I call GET with solrj?  I've googled for how to do this, and
haven't come up with anything.

Thanks.

Brian 



Re: 4.0 hanging on startup on Windows after Control-C

2013-03-18 Thread xavier jmlucjav
Hi Shawn,

I am using DIH with commit at the end...I'll investigate further to see if
this is what is happening and will report back, also will check 4.2 (that I
had to do anyway...).
thanks for your input
xavier


On Mon, Mar 18, 2013 at 6:12 PM, Shawn Heisey  wrote:

> On 3/17/2013 11:51 AM, xavier jmlucjav wrote:
>
>> Hi,
>>
>> I have an index where, if I kill solr via Control-C, it consistently hangs
>> next time I start it. Admin does not show cores, and searches never
>> return.
>> If I delete the index contents and I restart again all is ok. I am on
>> windows 7, jdk1.7 and Solr4.0.
>> Is this a known issue? I looked in jira but found nothing.
>>
>
> I scanned your thread dump.  Nothing jumped out at me, but given my
> inexperience with such things, I'm not surprised by that.
>
> Have you tried 4.1 or 4.2 yet to see if the problem persists?  4.0 is no
> longer the new hotness.
>
> Below I will discuss the culprit that springs to mind, though I don't know
> whether it's what you are actually hitting.
>
> One thing that can make Solr take a really long time to start up is huge
> transaction logs.  Transaction logs must be replayed when Solr starts, and
> if they are huge, it can take a really long time.
>
> Do you have tlog directories in your cores (in the data dir, next to the
> index directory), and if you do, how much disk space do they use?  The
> example config in 4.x has updateLog turned on.
>
> There are two common situations that can lead to huge transaction logs.
>  One is exclusively using soft commits when indexing, the other is running
> a very large import with the dataimport handler and not committing until
> the very end.
>
> AutoCommit with openSearcher=false is a good solution to both of these
> situations.  As long as you use openSearcher=false, it will not change what
> documents are visible.  AutoCommit does a regular "hard" commit every X new
> documents or every Y milliseconds.  A hard commit flushes index data to
> disk and starts a new transaction log.  Solr will only keep a few
> transaction logs around, so frequently building new ones keeps their size
> down.  When you restart Solr, you don't need to wait for a long time while
> it replays them.
>
> Thanks,
> Shawn
>
>


Re: Is Solr more CPU bound or IO bound?

2013-03-18 Thread Erick Erickson
And just to make it worse, I've seen lots of cases where the correct answer
is "neither, performance is constrained by memory" ...

Erick


On Sun, Mar 17, 2013 at 10:44 PM, David Parks wrote:

> Thank you, Manu, for that excellent discussion on the topic, I could have
> been more detailed about my use case.
>
> We'll be indexing off-of the main production servers (either on a master,
> or
> in Hadoop, we're yet to build out that piece of the puzzle). We don't store
> documents at all, we only store the index data and return a document ID,
> each document is maybe 1k of text, small.  We do have a few "interesting"
> queries in which we do some grouping.
>
> We currently index 100GB of input data, that'll grow 2x or 3x in the near
> future.
>
> So based on your experience, it seems likely that we'll be CPU bound (heavy
> queries against a static index updated nightly from the master), thus
> nullifying the advantage of dual-purposing a box with another CPU bound
> app.
>
> Very useful discussion, I'll get proper load tests done in time but this
> helps direct my thinking now.
>
> David
>
>
>
> -Original Message-
> From: idokis...@gmail.com [mailto:idokis...@gmail.com] On Behalf Of Manuel
> Le Normand
> Sent: Monday, March 18, 2013 9:57 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is Solr more CPU bound or IO bound?
>
> Your question is a typical use-case dependent, the bottleneck will change
> from user to user.
>
> These are two main issues that will affect the answer:
> 1. How do you index: what is your indexing rate (how many docs a days)? how
> big is a typical document? how many documents do you plan on indexing in
> tota? do you store fields? calculate their term vectors?
> 2. How looks you retrieval process: What's the query rate expected? Are
> there common queries (taking advantage of the cache)? Complexity of queries
> (faceted / highlighted / filtered/ how many conditions, NRT)? Do you plan
> to
> retrieve stored fields or only id's?
>
> After answering all that there's an interative game between hardware
> configuration and software configuration (how do you split your shards, use
> your cache, tuning your merges and flushes etc) that would also affect the
> IO / CPU bounded answer.
>
> In my use-case for example the indexing part is IO bounded, but as my
> indexing rate is much below the rate my machine could initially provide it
> didn't affect my hardware spec.
> After fine tuning my configuration i discovered my retrieval process was
> CPU
> bounded and was directly affecting my avg response time, while the IO rate
> in cache usage was quite low.
>
> Try describing your use case in more details with the above questions so
> we'd be able to give you guidelines.
>
> Best,
> Manu
>
>
> On Mon, Mar 18, 2013 at 3:55 AM, David Parks 
> wrote:
>
> > I'm spec'ing out some hardware for a first go at our production Solr
> > instance, but I haven't spent enough time loadtesting it yet.
> >
> >
> >
> > What I want to ask if how IO intensive solr is vs. CPU intensive,
> > typically.
> >
> >
> >
> > Specifically I'm considering whether to dual-purpose the Solr servers
> > to run Solr and another CPU-only application we have. I know Solr uses
> > a fair amount of CPU, but if it also is very disk intensive it might
> > be a net benefit to have more instances running Solr and share the CPU
> > resources with the other app than to run Solr separate from the other
> > CPU app that wouldn't otherwise use the disk.
> >
> >
> >
> > Thoughts on this?
> >
> >
> >
> > Thanks,
> >
> > David
> >
> >
> >
> >
>
>


Re: Incorrect snippets using FastVectorHighlighter

2013-03-18 Thread Koji Sekiguchi

So just to be clear:
There is no possibility to highlight results, if I use variable gram size.
Neither the original highlighter nor FVH do the job.
Or am I missing something?


I don't know the latest original highlighter has such restriction or not today,
but when FVH came in 2.9, at that time, the original highlighter couldn't
deal with n-gram field if n > 1, because (k)-th term's end offset can be
larger than (k+1)-th term's start offset.


Btw does any documentation exits how the VFH works?


See package summary:

http://lucene.apache.org/core/4_2_0/highlighter/org/apache/lucene/search/vectorhighlight/package-summary.html

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html


Re: Query.toString printing binary in the output...

2013-03-18 Thread Erick Erickson
If you simply attach &debug=all to your URL, you should see the query come
back in your response, XML, JSON, whatever. If that also shows bizarre
characters, then that will give you some idea whether it's in Solr or not.

But you haven't given us much info about how/where you call toString. You
may be getting into trouble with character sets (although I'd find that
quite odd, but its a possibility.

What I'm really finding confusing is that you're mentioning Term alongside
query.toString() (at least that's what I think you're saying), which has
nothing at all to do with Terms, it's just the query string passed in. So
I'm really puzzled as to what you're doing to get this kind of output, it
almost looks like you're trying to print out the _results_ of a query, not
the query.

So some clarification would be helpful...

Best
Erick


On Mon, Mar 18, 2013 at 12:01 PM, Andrew Lundgren  wrote:

> I am sorry, I don't follow what you mean by debug=query.  Can you
> elaborate on that a bit?
>
> Thanks!
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Sunday, March 17, 2013 8:09 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Query.toString printing binary in the output...
>
> Hmmm, without looking at the code, somehow when you specify debug=query
> you get readable results, maybe that code would be a place to start?
>
> And are you looking for the parsed output? Otherwise you could print
> original query.
>
> Not much help
> Erick
>
>
> On Fri, Mar 15, 2013 at 3:24 PM, Andrew Lundgren
> wrote:
>
> > We use the toString call on the query in our logs.  For some numeric
> > types, the encoded form of the number is being printed instead of the
> > readable form.
> >
> > This makes tail and some other tools very unhappy...
> >
> > Here is a partial example of a query.toString() that would have had
> > binary in it.  As a short term work around I replaced all
> > non-printable characters in the string with an '_'.
> >
> > (collection_id:`__z_[^0.027 collection_id:`__nB+^0.026
> > collection_id:`__Zl_^0.025 collection_id:`__i49^0.024
> > collection_id:`__Pq%^0.023 collection_id:`__VCS^0.022
> > collection_id:`__WbH^0.021 collection_id:`__Yu_^0.02
> > collection_id:`__UF&^0.019 collection_id:`__I2g^0.018
> > collection_id:`__PP_^0.01699 collection_id:`__Ysv^0.01599
> > collection_id:`__Oe_^0.01499 collection_id:`__Ysw^0.01399
> > collection_id:`__Wi_^0.01298 collection_id:`__fLi^0.01198
> > collection_id:`__XRk^0.01098 collection_id:`__Uz[^0.00998
> > collection_id:`__SE_^0.00898 collection_id:`__Ysx^0.00798
> > collection_id:`__Ysh^0.006974 collection_id:`__fLh^0.005973
> > collection_id:`__f _^0.00497 collection_id:`__`^C^0.00397
> > collection_id:`__fKM^0.00297 collection_id:`__Szo^0.00197
> > collection_id:`__f ]^9.7E-4)
> >
> > But, as you can see, that is less than useful...
> >
> > I spent some time looking at the source and found that Term does not
> > contain the type of the embedded data.  Any possible solutions to this
> > short of walking the query and getting the type of each field from the
> > schema and creating my own print function?
> >
> > Thanks!
> >
> > --
> > Andrew
> >
> >
> >
> >
> >  NOTICE: This email message is for the sole use of the intended
> > recipient(s) and may contain confidential and privileged information.
> > Any unauthorized review, use, disclosure or distribution is
> > prohibited. If you are not the intended recipient, please contact the
> > sender by reply email and destroy all copies of the original message.
> >
> >
>
>
>  NOTICE: This email message is for the sole use of the intended
> recipient(s) and may contain confidential and privileged information. Any
> unauthorized review, use, disclosure or distribution is prohibited. If you
> are not the intended recipient, please contact the sender by reply email
> and destroy all copies of the original message.
>
>


Re: Group By and Sum

2013-03-18 Thread Erick Erickson
Second Walter's comment. Make really, _really_ sure that "the powers that
be" recognize that they're asking for something unreasonable and it'll cost
them dearly to get it.

Best
Erick


On Mon, Mar 18, 2013 at 12:04 PM, Alan Woodward  wrote:

> Hi Adam,
>
> Have a look at the stats component:
> http://wiki.apache.org/solr/StatsComponent.  In your case, I think you'd
> need to add an extra field for your month, and then run a query filtered by
> your date range with stats.field=NetSales, stats.field=TransCount, and
> stats.facet=month.
>
> Make sure you use Solr 4.2 for this, by the way, as it's massively faster
> - I've found stats queries over ~500,000 documents dropping from 60 seconds
> to 2 seconds with an upgrade from 4.0 to 4.2.
>
> Alan Woodward
> www.flax.co.uk
>
>
> On 18 Mar 2013, at 16:48, Adam Harris wrote:
>
> > Hello All,
> >
> > Pretty stuck here and I am hoping you might be the person to help me
> out. I am working with SOLR and JSONiq which are totally new to me and
> doing even the simplest of things is just escaping me. I know SQL pretty
> well however this simple requirement seems escape me. I'll jump right into
> it.
> >
> > Here is the schema of my Core:
> >
> > 
> >
> >
> >
> >required="true"/>
> >
> >stored="true" required="true"/>
> >
> >stored="true" />
> >
> >required="true"/>
> >
> >stored="true" required="true"/>
> >
> >   
> >
> >required="true"/>
> >
> >required="true"/>
> >
> >
> >
> > 
> >
> > I need to group by the month of BusinessDateTime and sum up NetSales and
> TransCount for a given date range. Now if this were SQL i would just right
> >
> >
> > SELECT sum(TransCount), sum(NetSales)
> >
> > FROM Core
> >
> > WHERE BusinessDateTime BETWEEN '2012/04/01' AND '2013/04/01'
> >
> > GROUP BY MONTH(BusinessDateTime)
> >
> > But ofcourse nothing is this simple with SOLR and/or JSONiq. I have
> tried messing around with Facet and Group but they never seem to work the
> way i want them to. For example here is a query i am currently playing with:
> >
> >
> > ?wt=json
> >
> > &indent=true
> >
> > &q=*:*
> >
> > &rows=0
> >
> > &facet=true
> >
> > &facet.date=BusinessDateTime
> >
> > &facet.date.start=2012-02-01T00:00:01Z
> >
> > &facet.date.end=2013-02-01T23:59:59Z
> >
> > &facet.date.gap=%2B1MONTH
> >
> > &group=true
> >
> > &group.field=BusinessDateTime
> >
> > &group.facet=true
> >
> > &group.field=NetSales
> >
> > Now the facet is working properly however it is returning the count of
> the documents however i need the sum of the NetSales and the TransCount
> fields instead.
> >
> > Any help or suggestions would be greatly appreciated.
> >
> > Thanks,
> > Adam
>
>


Re: DIH silently ignoring a record

2013-03-18 Thread Shalin Shekhar Mangar
That does sound perplexing.

Justin, can you tell us which field in the query is your record id? What is
the record id's type in database and in solr schema? What is your unique
key and its type in solr schema?


On Tue, Mar 19, 2013 at 5:19 AM, Justin L.  wrote:

> Every time I do an import, DataImportHandler is not importing 1 row from my
> database.
>
> I have 3 entities each defined with a single query. I have confirmed, by
> looking at totals from solr as well as comparing a "*:*" query to direct db
> queries-- exactly 1 row is missing every time. And its the same row- the
> first row of one of my entities when sorted by primary key. The other two
> entities are fully imported without trouble.
>
> There are no errors in the log- even when DIH logging is turned up to FINE.
> When I alter the query to retrieve only the mysterious record, it shows up
> as "Fetched: 1 Skipped: 0 Processed: 1". But when I do a query for *:* it
> returns 0 documents.
>
> Ready for a twist? The DIH query for this entity does not have an ORDER BY
> clause- when I add one to sort by primary key DESC it imports all of the
> rows for that entity, including the mysterious record.
>
> Ready to have your mind blown? I am using the alternative method for doing
> delta imports (see query below). When I make clean=false, and update the
> timestamp on the mysterious record- yup- it gets imported properly.
>
>
>
> Because I have the ORDER BY DESC hack, I can get by and live to fight
> another day. But I thought someone might like to know this because I think
> I am hitting a bug in DIH- specifically, something after the querying but
> before the posting to solr. If someone familiar with DIH innards wants to
> suggest where I should look or how to step through it, I'd be willing to
> take a look.
>
> xoxo,
> Justin
>
>
> * Fun facts:
> Solr 4.0
> Oracle 11g
> The mysterious record's id is "01"
> I use field elements to rename the columns rather than in-the-sql aliases
> because of a problem I had with them earlier. But I will try changing that.
>
>
> * Alternative delta import method:
>
> http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport
>
>
> * DIH query that should import mysterious record:
>
> select organization_name, organization_id, address
> from organization o
> join rolodex r on r.rolodex_id = o.contact_address_id
> and r.sponsor_address_flag = 'N'
> and r.actv_ind = 'Y'
> where '${dataimporter.request.clean}' = 'true'
> or to_char(o.update_timestamp,'-MM-DD HH24:MI:SS') >
> '${dataimporter.organization.last_index_time
>



-- 
Regards,
Shalin Shekhar Mangar.


Solr Core Creation dynamically

2013-03-18 Thread Ravi_Mandala
Hi,

I am trying to create new core dynamically(programmatically) in solr 4.0.I
tried with

http://localhost:7081/apache-solr-4.0.0/admin/cores?action=CREATE&name=coreX&instanceDir=coreX&config=solr-config.xml&schema=schema.xml&dataDir=data

But I am not able to create a core. Is there any way to create a new
core(Directories)? 

Please mail me if u have get anything.

Thanks,
Ravi.M



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Core-Creation-dynamically-tp4048872.html
Sent from the Solr - User mailing list archive at Nabble.com.


SolrCloud with Zookeeper ensemble : fail to restart master server

2013-03-18 Thread Patrick Mi
Hi there,

I have experienced some problems starting the master server.

Solr4.2 under Tomcat 7 on Centos6.

Configuration : 
3 solr instances running on different machines, one shard, 3 cores, 2
replicas, using Zookeeper comes with Solr 

The master server A has the following run option: -Dbootstrap_conf=true
-DzkRun -DnumShards=1, 
The slave servers B and C have : -DzkHost=masterServerIP:2181 

It works well for add/update/delete etc after I start up master and slave
servers in order.

When the master A is up stop/start slave B and C are OK.

When slave B and C are running I couldn't restart master A. Only after I
shutdown B and C then I can start master A.

Is this a feature or bug or something I haven't configure properly?

Thanks advance for your help

Regards,
Patrick



RE: SnapPull failed - SOLR 4.1

2013-03-18 Thread Sandeep Kumar Anumalla
Hi Mark,

I have upgraded Solr 4.2 still I am getting this exception.


INFO: removing temporary index download directory files 
NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/data/solr-4.2.0/example/solr/collection1/data/index.20130319101506108
 lockFactory=org.apache.lucene.store.SimpleFSLockFactory@47042c25; 
maxCacheMB=48.0 maxMergeSizeMB=4.0)
Mar 19, 2013 10:19:02 AM org.apache.solr.common.SolrException log
SEVERE: SnapPull failed :org.apache.solr.common.SolrException: Unable to 
download _1l1.fdt completely. Downloaded 0!=256950



Thanks
Sandeep A.

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com]
Sent: 18 March, 2013 10:25 AM
To: solr-user@lucene.apache.org
Subject: Re: SnapPull failed - SOLR 4.1

This is probably related to some Replication bugs that 4.1 had - 4.2 is 
probably your best bet for a fix.

- Mark

On Mar 18, 2013, at 1:48 AM, Sandeep Kumar Anumalla  
wrote:

> SEVERE: SnapPull failed :org.apache.solr.common.SolrException: Unable to 
> download _xv0_Lucene41_0.doc completely. Downloaded 0!=5935
>
>
> We are continuously getting the above exception in our replication (SALVE) 
> machine.
>
> I tried compression option, increased bandwidth between Master and Slave 
> also, still facing the issue
>
> Please let me know how resolve this issue.
>
>
> Thanks & Regards
> Sandeep A
> Ext : 02618-2856
> M : 0502493820
>
>
> 
> The content of this email together with any attachments, statements and 
> opinions expressed herein contains information that is private and 
> confidential are intended for the named addressee(s) only. If you are not the 
> addressee of this email you may not copy, forward, disclose or otherwise use 
> it or any part of it in any form whatsoever. If you have received this 
> message in error please notify postmas...@etisalat.ae by email immediately 
> and delete the message without making any copies.


The content of this email together with any attachments, statements and 
opinions expressed herein contains information that is private and confidential 
are intended for the named addressee(s) only. If you are not the addressee of 
this email you may not copy, forward, disclose or otherwise use it or any part 
of it in any form whatsoever. If you have received this message in error please 
notify postmas...@etisalat.ae by email immediately and delete the message 
without making any copies.