indexing java byte code in classes / jars

2015-05-08 Thread Mark
I looking to use Solr search over the byte code in Classes and Jars.

Does anyone know or have experience of Analyzers, Tokenizers, and Token
Filters for such a task?

Regards

Mark


Re: indexing java byte code in classes / jars

2015-05-08 Thread Mark
To answer why bytecode - because mostly the use case I have is looking to
index as much detail from jars/classes.

extract class names,
method names
signatures
packages / imports

I am considering using ASM in order to generate an analysis view of the
class

The sort of usecases I have would be method / signature searches.

For example;

1) show any classes with a method named parse*

2) show any classes with a method named parse that passes in a type *json*

...etc

In the past I have written something to reverse out javadocs from just java
bytecode, using solr would move this idea considerably much more powerful.

Thanks for the suggestions so far







On 8 May 2015 at 21:19, Erik Hatcher  wrote:

> Oh, and sorry, I omitted a couple of details:
>
> # creating the “java” core/collection
> bin/solr create -c java
>
> # I ran this from my Solr source code checkout, so that
> SolrLogFormatter.class just happened to be handy
>
> Erik
>
>
>
>
> > On May 8, 2015, at 4:11 PM, Erik Hatcher  wrote:
> >
> > What kinds of searches do you want to run?  Are you trying to extract
> class names, method names, and such and make those searchable?   If that’s
> the case, you need some kind of “parser” to reverse engineer that
> information from .class and .jar files before feeding it to Solr, which
> would happen before analysis.   Java itself comes with a javap command that
> can do this; whether this is the “best” way to go for your scenario I don’t
> know, but here’s an interesting example pasted below (using Solr 5.x).
> >
> > —
> > Erik Hatcher, Senior Solutions Architect
> > http://www.lucidworks.com
> >
> >
> > javap
> build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class >
> test.txt
> > bin/post -c java test.txt
> >
> > now search for "coreInfoMap"
> http://localhost:8983/solr/java/browse?q=coreInfoMap
> >
> > I tried to be cleverer and use the stdin option of bin/post, like this:
> > javap
> build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class |
> bin/post -c java -url http://localhost:8983/solr/java/update/extract
> -type text/plain -params "literal.id=SolrLogFormatter" -out yes -d
> > but something isn’t working right with the stdin detection like that (it
> does work to `cat test.txt | bin/post…` though, hmmm)
> >
> > test.txt looks like this, `cat test.txt`:
> > Compiled from "SolrLogFormatter.java"
> > public class org.apache.solr.SolrLogFormatter extends
> java.util.logging.Formatter {
> >  long startTime;
> >  long lastTime;
> >  java.util.Map java.lang.String> methodAlias;
> >  public boolean shorterFormat;
> >  java.util.Map org.apache.solr.SolrLogFormatter$CoreInfo> coreInfoMap;
> >  public java.util.Map classAliases;
> >  static java.lang.ThreadLocal threadLocal;
> >  public org.apache.solr.SolrLogFormatter();
> >  public void setShorterFormat();
> >  public java.lang.String format(java.util.logging.LogRecord);
> >  public void appendThread(java.lang.StringBuilder,
> java.util.logging.LogRecord);
> >  public java.lang.String _format(java.util.logging.LogRecord);
> >  public java.lang.String getHead(java.util.logging.Handler);
> >  public java.lang.String getTail(java.util.logging.Handler);
> >  public java.lang.String formatMessage(java.util.logging.LogRecord);
> >  public static void main(java.lang.String[]) throws java.lang.Exception;
> >  public static void go() throws java.lang.Exception;
> >  static {};
> > }
> >
> >> On May 8, 2015, at 3:31 PM, Mark  wrote:
> >>
> >> I looking to use Solr search over the byte code in Classes and Jars.
> >>
> >> Does anyone know or have experience of Analyzers, Tokenizers, and Token
> >> Filters for such a task?
> >>
> >> Regards
> >>
> >> Mark
> >
>
>


Re: indexing java byte code in classes / jars

2015-05-08 Thread Mark
https://searchcode.com/

looks really interesting, however I want to crunch as much searchable
aspects out of jars sititng on a classpath or under a project structure...

Really early days so I'm open to any suggestions



On 8 May 2015 at 22:09, Mark  wrote:

> To answer why bytecode - because mostly the use case I have is looking to
> index as much detail from jars/classes.
>
> extract class names,
> method names
> signatures
> packages / imports
>
> I am considering using ASM in order to generate an analysis view of the
> class
>
> The sort of usecases I have would be method / signature searches.
>
> For example;
>
> 1) show any classes with a method named parse*
>
> 2) show any classes with a method named parse that passes in a type *json*
>
> ...etc
>
> In the past I have written something to reverse out javadocs from just
> java bytecode, using solr would move this idea considerably much more
> powerful.
>
> Thanks for the suggestions so far
>
>
>
>
>
>
>
> On 8 May 2015 at 21:19, Erik Hatcher  wrote:
>
>> Oh, and sorry, I omitted a couple of details:
>>
>> # creating the “java” core/collection
>> bin/solr create -c java
>>
>> # I ran this from my Solr source code checkout, so that
>> SolrLogFormatter.class just happened to be handy
>>
>> Erik
>>
>>
>>
>>
>> > On May 8, 2015, at 4:11 PM, Erik Hatcher 
>> wrote:
>> >
>> > What kinds of searches do you want to run?  Are you trying to extract
>> class names, method names, and such and make those searchable?   If that’s
>> the case, you need some kind of “parser” to reverse engineer that
>> information from .class and .jar files before feeding it to Solr, which
>> would happen before analysis.   Java itself comes with a javap command that
>> can do this; whether this is the “best” way to go for your scenario I don’t
>> know, but here’s an interesting example pasted below (using Solr 5.x).
>> >
>> > —
>> > Erik Hatcher, Senior Solutions Architect
>> > http://www.lucidworks.com
>> >
>> >
>> > javap
>> build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class >
>> test.txt
>> > bin/post -c java test.txt
>> >
>> > now search for "coreInfoMap"
>> http://localhost:8983/solr/java/browse?q=coreInfoMap
>> >
>> > I tried to be cleverer and use the stdin option of bin/post, like this:
>> > javap
>> build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class |
>> bin/post -c java -url http://localhost:8983/solr/java/update/extract
>> -type text/plain -params "literal.id=SolrLogFormatter" -out yes -d
>> > but something isn’t working right with the stdin detection like that
>> (it does work to `cat test.txt | bin/post…` though, hmmm)
>> >
>> > test.txt looks like this, `cat test.txt`:
>> > Compiled from "SolrLogFormatter.java"
>> > public class org.apache.solr.SolrLogFormatter extends
>> java.util.logging.Formatter {
>> >  long startTime;
>> >  long lastTime;
>> >  java.util.Map> java.lang.String> methodAlias;
>> >  public boolean shorterFormat;
>> >  java.util.Map> org.apache.solr.SolrLogFormatter$CoreInfo> coreInfoMap;
>> >  public java.util.Map classAliases;
>> >  static java.lang.ThreadLocal threadLocal;
>> >  public org.apache.solr.SolrLogFormatter();
>> >  public void setShorterFormat();
>> >  public java.lang.String format(java.util.logging.LogRecord);
>> >  public void appendThread(java.lang.StringBuilder,
>> java.util.logging.LogRecord);
>> >  public java.lang.String _format(java.util.logging.LogRecord);
>> >  public java.lang.String getHead(java.util.logging.Handler);
>> >  public java.lang.String getTail(java.util.logging.Handler);
>> >  public java.lang.String formatMessage(java.util.logging.LogRecord);
>> >  public static void main(java.lang.String[]) throws java.lang.Exception;
>> >  public static void go() throws java.lang.Exception;
>> >  static {};
>> > }
>> >
>> >> On May 8, 2015, at 3:31 PM, Mark  wrote:
>> >>
>> >> I looking to use Solr search over the byte code in Classes and Jars.
>> >>
>> >> Does anyone know or have experience of Analyzers, Tokenizers, and Token
>> >> Filters for such a task?
>> >>
>> >> Regards
>> >>
>> >> Mark
>> >
>>
>>
>


Re: indexing java byte code in classes / jars

2015-05-08 Thread Mark
Erik,

Thanks for the pretty much OOTB approach.

I think I'm going to just try a range of approaches, and see how far I get.

The "IDE does this suggestion" would be worth looking into as well.




On 8 May 2015 at 22:14, Mark  wrote:

>
> https://searchcode.com/
>
> looks really interesting, however I want to crunch as much searchable
> aspects out of jars sititng on a classpath or under a project structure...
>
> Really early days so I'm open to any suggestions
>
>
>
> On 8 May 2015 at 22:09, Mark  wrote:
>
>> To answer why bytecode - because mostly the use case I have is looking to
>> index as much detail from jars/classes.
>>
>> extract class names,
>> method names
>> signatures
>> packages / imports
>>
>> I am considering using ASM in order to generate an analysis view of the
>> class
>>
>> The sort of usecases I have would be method / signature searches.
>>
>> For example;
>>
>> 1) show any classes with a method named parse*
>>
>> 2) show any classes with a method named parse that passes in a type *json*
>>
>> ...etc
>>
>> In the past I have written something to reverse out javadocs from just
>> java bytecode, using solr would move this idea considerably much more
>> powerful.
>>
>> Thanks for the suggestions so far
>>
>>
>>
>>
>>
>>
>>
>> On 8 May 2015 at 21:19, Erik Hatcher  wrote:
>>
>>> Oh, and sorry, I omitted a couple of details:
>>>
>>> # creating the “java” core/collection
>>> bin/solr create -c java
>>>
>>> # I ran this from my Solr source code checkout, so that
>>> SolrLogFormatter.class just happened to be handy
>>>
>>> Erik
>>>
>>>
>>>
>>>
>>> > On May 8, 2015, at 4:11 PM, Erik Hatcher 
>>> wrote:
>>> >
>>> > What kinds of searches do you want to run?  Are you trying to extract
>>> class names, method names, and such and make those searchable?   If that’s
>>> the case, you need some kind of “parser” to reverse engineer that
>>> information from .class and .jar files before feeding it to Solr, which
>>> would happen before analysis.   Java itself comes with a javap command that
>>> can do this; whether this is the “best” way to go for your scenario I don’t
>>> know, but here’s an interesting example pasted below (using Solr 5.x).
>>> >
>>> > —
>>> > Erik Hatcher, Senior Solutions Architect
>>> > http://www.lucidworks.com
>>> >
>>> >
>>> > javap
>>> build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class >
>>> test.txt
>>> > bin/post -c java test.txt
>>> >
>>> > now search for "coreInfoMap"
>>> http://localhost:8983/solr/java/browse?q=coreInfoMap
>>> >
>>> > I tried to be cleverer and use the stdin option of bin/post, like this:
>>> > javap
>>> build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class |
>>> bin/post -c java -url http://localhost:8983/solr/java/update/extract
>>> -type text/plain -params "literal.id=SolrLogFormatter" -out yes -d
>>> > but something isn’t working right with the stdin detection like that
>>> (it does work to `cat test.txt | bin/post…` though, hmmm)
>>> >
>>> > test.txt looks like this, `cat test.txt`:
>>> > Compiled from "SolrLogFormatter.java"
>>> > public class org.apache.solr.SolrLogFormatter extends
>>> java.util.logging.Formatter {
>>> >  long startTime;
>>> >  long lastTime;
>>> >  java.util.Map>> java.lang.String> methodAlias;
>>> >  public boolean shorterFormat;
>>> >  java.util.Map>> org.apache.solr.SolrLogFormatter$CoreInfo> coreInfoMap;
>>> >  public java.util.Map classAliases;
>>> >  static java.lang.ThreadLocal threadLocal;
>>> >  public org.apache.solr.SolrLogFormatter();
>>> >  public void setShorterFormat();
>>> >  public java.lang.String format(java.util.logging.LogRecord);
>>> >  public void appendThread(java.lang.StringBuilder,
>>> java.util.logging.LogRecord);
>>> >  public java.lang.String _format(java.util.logging.LogRecord);
>>> >  public java.lang.String getHead(java.util.logging.Handler);
>>> >  public java.lang.String getTail(java.util.logging.Handler);
>>> >  public java.lang.String formatMessage(java.util.logging.LogRecord);
>>> >  public static void main(java.lang.String[]) throws
>>> java.lang.Exception;
>>> >  public static void go() throws java.lang.Exception;
>>> >  static {};
>>> > }
>>> >
>>> >> On May 8, 2015, at 3:31 PM, Mark  wrote:
>>> >>
>>> >> I looking to use Solr search over the byte code in Classes and Jars.
>>> >>
>>> >> Does anyone know or have experience of Analyzers, Tokenizers, and
>>> Token
>>> >> Filters for such a task?
>>> >>
>>> >> Regards
>>> >>
>>> >> Mark
>>> >
>>>
>>>
>>
>


Re: indexing java byte code in classes / jars

2015-05-09 Thread Mark
Hi  Alexandre,

Solr & ASM is the extact poblem I'm looking to hack about with so I'm keen
to consider any code no matter how ugly or broken

Regards

Mark

On 9 May 2015 at 10:21, Alexandre Rafalovitch  wrote:

> If you only have classes/jars, use ASM. I have done this before, have some
> ugly code to share if you want.
>
> If you have sources, javadoc 8 is a good way too. I am doing that now for
> solr-start.com, code on Github.
>
> Regards,
>     Alex
> On 9 May 2015 7:09 am, "Mark"  wrote:
>
> > To answer why bytecode - because mostly the use case I have is looking to
> > index as much detail from jars/classes.
> >
> > extract class names,
> > method names
> > signatures
> > packages / imports
> >
> > I am considering using ASM in order to generate an analysis view of the
> > class
> >
> > The sort of usecases I have would be method / signature searches.
> >
> > For example;
> >
> > 1) show any classes with a method named parse*
> >
> > 2) show any classes with a method named parse that passes in a type
> *json*
> >
> > ...etc
> >
> > In the past I have written something to reverse out javadocs from just
> java
> > bytecode, using solr would move this idea considerably much more
> powerful.
> >
> > Thanks for the suggestions so far
> >
> >
> >
> >
> >
> >
> >
> > On 8 May 2015 at 21:19, Erik Hatcher  wrote:
> >
> > > Oh, and sorry, I omitted a couple of details:
> > >
> > > # creating the “java” core/collection
> > > bin/solr create -c java
> > >
> > > # I ran this from my Solr source code checkout, so that
> > > SolrLogFormatter.class just happened to be handy
> > >
> > > Erik
> > >
> > >
> > >
> > >
> > > > On May 8, 2015, at 4:11 PM, Erik Hatcher 
> > wrote:
> > > >
> > > > What kinds of searches do you want to run?  Are you trying to extract
> > > class names, method names, and such and make those searchable?   If
> > that’s
> > > the case, you need some kind of “parser” to reverse engineer that
> > > information from .class and .jar files before feeding it to Solr, which
> > > would happen before analysis.   Java itself comes with a javap command
> > that
> > > can do this; whether this is the “best” way to go for your scenario I
> > don’t
> > > know, but here’s an interesting example pasted below (using Solr 5.x).
> > > >
> > > > —
> > > > Erik Hatcher, Senior Solutions Architect
> > > > http://www.lucidworks.com
> > > >
> > > >
> > > > javap
> > > build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class >
> > > test.txt
> > > > bin/post -c java test.txt
> > > >
> > > > now search for "coreInfoMap"
> > > http://localhost:8983/solr/java/browse?q=coreInfoMap
> > > >
> > > > I tried to be cleverer and use the stdin option of bin/post, like
> this:
> > > > javap
> > > build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class |
> > > bin/post -c java -url http://localhost:8983/solr/java/update/extract
> > > -type text/plain -params "literal.id=SolrLogFormatter" -out yes -d
> > > > but something isn’t working right with the stdin detection like that
> > (it
> > > does work to `cat test.txt | bin/post…` though, hmmm)
> > > >
> > > > test.txt looks like this, `cat test.txt`:
> > > > Compiled from "SolrLogFormatter.java"
> > > > public class org.apache.solr.SolrLogFormatter extends
> > > java.util.logging.Formatter {
> > > >  long startTime;
> > > >  long lastTime;
> > > >  java.util.Map > > java.lang.String> methodAlias;
> > > >  public boolean shorterFormat;
> > > >  java.util.Map > > org.apache.solr.SolrLogFormatter$CoreInfo> coreInfoMap;
> > > >  public java.util.Map
> classAliases;
> > > >  static java.lang.ThreadLocal threadLocal;
> > > >  public org.apache.solr.SolrLogFormatter();
> > > >  public void setShorterFormat();
> > > >  public java.lang.String format(java.util.logging.LogRecord);
> > > >  public void appendThread(java.lang.StringBuilder,
> > > java.util.logging.LogRecord);
> > > >  public java.lang.String _format(java.util.logging.LogRecord);
> > > >  public java.lang.String getHead(java.util.logging.Handler);
> > > >  public java.lang.String getTail(java.util.logging.Handler);
> > > >  public java.lang.String formatMessage(java.util.logging.LogRecord);
> > > >  public static void main(java.lang.String[]) throws
> > java.lang.Exception;
> > > >  public static void go() throws java.lang.Exception;
> > > >  static {};
> > > > }
> > > >
> > > >> On May 8, 2015, at 3:31 PM, Mark  wrote:
> > > >>
> > > >> I looking to use Solr search over the byte code in Classes and Jars.
> > > >>
> > > >> Does anyone know or have experience of Analyzers, Tokenizers, and
> > Token
> > > >> Filters for such a task?
> > > >>
> > > >> Regards
> > > >>
> > > >> Mark
> > > >
> > >
> > >
> >
>


Configuring number or shards

2013-11-05 Thread Mark
Can you configure the number of shards per collection or is this a system wide 
setting affecting all collections/indexes?

Thanks

Sharding and replicas (Solr Cloud)

2013-11-07 Thread Mark
If I create my collection via the ZkCLI 
(https://cwiki.apache.org/confluence/display/solr/Command+Line+Utilities) how 
do I configure the number of shards and replicas?

Thanks

SimplePostTool with extracted Outlook messages

2015-01-26 Thread Mark
I'm looking to index some outlook extracted messages *.msg

I notice by default msg isn't one of the defaults so I tried the following:

java -classpath dist/solr-core-4.10.3.jar -Dtype=application/vnd.ms-outlook
org.apache.solr.util.SimplePostTool C:/temp/samplemsg/*.msg

That didn't work

However curl did:

curl "
http://localhost:8983/solr/update/extract?commit=true&overwrite=true&literal.id=6252671B765A1748992DF1A6403BDF81A4A15E00";
-F "myfile=@6252671B765A1748992DF1A6403BDF81A4A15E00.msg"

My question is why does the second work and not the first?


Re: SimplePostTool with extracted Outlook messages

2015-01-26 Thread Mark
A little further

This fails

 java -classpath dist/solr-core-4.10.3.jar
-Dtype=application/vnd.ms-outlook org.apache.solr.util.SimplePostTool
C:/temp/samplemsg/*.msg

With:

SimplePostTool: WARNING: IOException while reading response:
java.io.IOException: Server returned HTTP response code: 415 for URL:
http://localhost:8983/solr/update
POSTing file 6252671B765A1748992DF1A6403BDF81A4A22C00.msg
SimplePostTool: WARNING: Solr returned an error #415 (Unsupported Media
Type) for url: http://localhost:8983/solr/update
SimplePostTool: WARNING: Response: 

4150Unsupported
ContentType: application/vnd.ms-outlook  Not in: [applicat
ion/xml, text/csv, text/json, application/csv, application/javabin,
text/xml, application/json]415


However just calling the extract works

curl "http://localhost:8983/solr/update/extract?extractOnly=true"; -F
"myfile=@6252671B765A1748992DF1A6403BDF81A4A22C00.msg"

Regards

Mark

On 26 January 2015 at 21:47, Alexandre Rafalovitch 
wrote:

> Seems like apple to oranges comparison here.
>
> I would try giving an explicit end point (.../extract), a single
> message, and a literal id for the SimplePostTool and seeing whether
> that works. Not providing an ID could definitely be an issue.
>
> I would also specifically look on the server side in the logs and see
> what the messages say to understand the discrepancies. Solr 5 is a bit
> more verbose about what's going under the covers, but that's not
> available yet.
>
> Regards,
>Alex.
> 
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>
>
> On 26 January 2015 at 16:34, Mark  wrote:
> > I'm looking to index some outlook extracted messages *.msg
> >
> > I notice by default msg isn't one of the defaults so I tried the
> following:
> >
> > java -classpath dist/solr-core-4.10.3.jar
> -Dtype=application/vnd.ms-outlook
> > org.apache.solr.util.SimplePostTool C:/temp/samplemsg/*.msg
> >
> > That didn't work
> >
> > However curl did:
> >
> > curl "
> >
> http://localhost:8983/solr/update/extract?commit=true&overwrite=true&literal.id=6252671B765A1748992DF1A6403BDF81A4A15E00
> "
> > -F "myfile=@6252671B765A1748992DF1A6403BDF81A4A15E00.msg"
> >
> > My question is why does the second work and not the first?
>


Re: SimplePostTool with extracted Outlook messages

2015-01-26 Thread Mark
Fantastic - that explians it

Adding -Durl="
http://localhost:8983/solr/update/extract?commit=true&overwrite=true";

Get's me a little further

POSTing file 6252671B765A1748992DF1A6403BDF81A4A22E00.msg
SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for url:
http://localhost:8983/solr/update/extract?commit=true&overwrite=true
SimplePostTool: WARNING: Response: 

40044Document is
missing mandatory uniqueKey field: id400


However not much use when recursing a directory and the URL essentially has
to change to pass the document ID

I think I may just extend SimplePostToo or look to use Solr Cell perhaps?



On 26 January 2015 at 22:14, Alexandre Rafalovitch 
wrote:

> Well, you are NOT posting to the same URL.
>
>
> On 26 January 2015 at 17:00, Mark  wrote:
> > http://localhost:8983/solr/update
>
>
>
> 
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>


Re: SimplePostTool with extracted Outlook messages

2015-01-27 Thread Mark
Thanks Eric

However

java -classpath dist/solr-core-4.10.3.jar -Dauto=true
org.apache.solr.util.SimplePostTool C:/temp/samplemsg/*.msg

Fails with:

osting files to base url http://localhost:8983/solr/update..
ntering auto mode. File endings considered are
xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
implePostTool: WARNING: Skipping
6252671B765A1748992DF1A6403BDF81A4A02A00.msg. Unsupported file type
for auto mode.
implePostTool: WARNING: Skipping
6252671B765A1748992DF1A6403BDF81A4A02B00.msg. Unsupported file type
for auto mode.
implePostTool: WARNING: Skipping
6252671B765A1748992DF1A6403BDF81A4A02C00.msg. Unsupported file type
for auto mode.

That's where I started looking into extending or adding support for
additional types.

Looking into the code as it stands passing you own URL as well as asking it
to recurse a folder means that is requires an ID strategy - which I believe
is lacking.

Reagrds

Mark



On 27 January 2015 at 10:57, Erik Hatcher  wrote:

> Try adding -Dauto=true and take away setting url.  The type probably isn't
> needed then either.
>
> With the new Solr 5 bin/post it sets auto=true implicitly.
>
> Erik
>
>
> > On Jan 26, 2015, at 17:29, Mark  wrote:
> >
> > Fantastic - that explians it
> >
> > Adding -Durl="
> > http://localhost:8983/solr/update/extract?commit=true&overwrite=true";
> >
> > Get's me a little further
> >
> > POSTing file 6252671B765A1748992DF1A6403BDF81A4A22E00.msg
> > SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for
> url:
> > http://localhost:8983/solr/update/extract?commit=true&overwrite=true
> > SimplePostTool: WARNING: Response: 
> > 
> > 400 > name="QTime">44Document is
> > missing mandatory uniqueKey field: id400
> > 
> >
> > However not much use when recursing a directory and the URL essentially
> has
> > to change to pass the document ID
> >
> > I think I may just extend SimplePostToo or look to use Solr Cell perhaps?
> >
> >
> >
> > On 26 January 2015 at 22:14, Alexandre Rafalovitch 
> > wrote:
> >
> >> Well, you are NOT posting to the same URL.
> >>
> >>
> >>> On 26 January 2015 at 17:00, Mark  wrote:
> >>> http://localhost:8983/solr/update
> >>
> >>
> >>
> >> 
> >> Sign up for my Solr resources newsletter at http://www.solr-start.com/
> >>
>


Re: SimplePostTool with extracted Outlook messages

2015-01-27 Thread Mark
Hi Alex,

On an individual file basis that would work, since you could set the ID on
an individual basis.

However recuring a folder it doesn't work, and worse still the server
complains, unless on the server side you can use the UpdateRequestProcessor
chains with  UUID generator as you suggested.

Thanks for eveyones suggestions.

Regards

Mark

On 27 January 2015 at 18:01, Alexandre Rafalovitch 
wrote:

> Your IDs seem to be the file names, which you are probably also getting
> from your parsing the file. Can't you just set (or copyField) that as an ID
> on the Solr side?
>
> Alternatively, if you don't actually have good IDs, you could look into
> UpdateRequestProcessor chains with  UUID generator.
>
> Regards,
>
>Alex.
> On 27/01/2015 12:24 pm, "Mark"  wrote:
>
> > Thanks Eric
> >
> > However
> >
> > java -classpath dist/solr-core-4.10.3.jar -Dauto=true
> > org.apache.solr.util.SimplePostTool C:/temp/samplemsg/*.msg
> >
> > Fails with:
> >
> > osting files to base url http://localhost:8983/solr/update..
> > ntering auto mode. File endings considered are
> >
> >
> xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> > implePostTool: WARNING: Skipping
> > 6252671B765A1748992DF1A6403BDF81A4A02A00.msg. Unsupported file
> type
> > for auto mode.
> > implePostTool: WARNING: Skipping
> > 6252671B765A1748992DF1A6403BDF81A4A02B00.msg. Unsupported file
> type
> > for auto mode.
> > implePostTool: WARNING: Skipping
> > 6252671B765A1748992DF1A6403BDF81A4A02C00.msg. Unsupported file
> type
> > for auto mode.
> >
> > That's where I started looking into extending or adding support for
> > additional types.
> >
> > Looking into the code as it stands passing you own URL as well as asking
> it
> > to recurse a folder means that is requires an ID strategy - which I
> believe
> > is lacking.
> >
> > Reagrds
> >
> > Mark
> >
> >
>


Re: SimplePostTool with extracted Outlook messages

2015-01-27 Thread Mark
In the end I didn't find a way to add a new file/ mime type for recursing a
folder.

So I added msg to the static dtring and Mime map.

private static final String DEFAULT_FILE_TYPES =
"xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log,msg";

mimeMap.put("msg", "application/vnd.ms-outlook");

Regards

Mark


On 27 January 2015 at 18:39, Mark  wrote:

> Hi Alex,
>
> On an individual file basis that would work, since you could set the ID on
> an individual basis.
>
> However recuring a folder it doesn't work, and worse still the server
> complains, unless on the server side you can use the UpdateRequestProcessor
> chains with  UUID generator as you suggested.
>
> Thanks for eveyones suggestions.
>
> Regards
>
> Mark
>
> On 27 January 2015 at 18:01, Alexandre Rafalovitch 
> wrote:
>
>> Your IDs seem to be the file names, which you are probably also getting
>> from your parsing the file. Can't you just set (or copyField) that as an
>> ID
>> on the Solr side?
>>
>> Alternatively, if you don't actually have good IDs, you could look into
>> UpdateRequestProcessor chains with  UUID generator.
>>
>> Regards,
>>
>>Alex.
>> On 27/01/2015 12:24 pm, "Mark"  wrote:
>>
>> > Thanks Eric
>> >
>> > However
>> >
>> > java -classpath dist/solr-core-4.10.3.jar -Dauto=true
>> > org.apache.solr.util.SimplePostTool C:/temp/samplemsg/*.msg
>> >
>> > Fails with:
>> >
>> > osting files to base url http://localhost:8983/solr/update..
>> > ntering auto mode. File endings considered are
>> >
>> >
>> xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
>> > implePostTool: WARNING: Skipping
>> > 6252671B765A1748992DF1A6403BDF81A4A02A00.msg. Unsupported file
>> type
>> > for auto mode.
>> > implePostTool: WARNING: Skipping
>> > 6252671B765A1748992DF1A6403BDF81A4A02B00.msg. Unsupported file
>> type
>> > for auto mode.
>> > implePostTool: WARNING: Skipping
>> > 6252671B765A1748992DF1A6403BDF81A4A02C00.msg. Unsupported file
>> type
>> > for auto mode.
>> >
>> > That's where I started looking into extending or adding support for
>> > additional types.
>> >
>> > Looking into the code as it stands passing you own URL as well as
>> asking it
>> > to recurse a folder means that is requires an ID strategy - which I
>> believe
>> > is lacking.
>> >
>> > Reagrds
>> >
>> > Mark
>> >
>> >
>>
>
>


extract and add fields on the fly

2015-01-28 Thread Mark
Is it possible to use curl to upload a document (for extract & indexing)
and specify some fields on the fly?

sort of:
1) index this document
2) by the way here are some important facets whilst your at it

Regards

Mark


Re: extract and add fields on the fly

2015-01-28 Thread Mark
"Create the SID from the existing doc" implies that a document already
exists that you wish to add fields to.

However if the document is a binary are you suggesting

1) curl to upload/extract passing docID
2) obtain a SID based off docID
3) add addtinal fields to SID & commit

I know I'm possibly wandering into the schemaless teritory here as well


On 28 January 2015 at 17:11, Andrew Pawloski  wrote:

> I would switch the order of those. Add the new fields and *then* index to
> solr.
>
> We do something similar when we create SolrInputDocuments that are pushed
> to solr. Create the SID from the existing doc, add any additional fields,
> then add to solr.
>
> On Wed, Jan 28, 2015 at 11:56 AM, Mark  wrote:
>
> > Is it possible to use curl to upload a document (for extract & indexing)
> > and specify some fields on the fly?
> >
> > sort of:
> > 1) index this document
> > 2) by the way here are some important facets whilst your at it
> >
> > Regards
> >
> > Mark
> >
>


Re: extract and add fields on the fly

2015-01-28 Thread Mark
Second thoughts SID is purely i/p as its name suggests :)

I think a better approach would be

1) curl to upload/extract passing docID
2) curl to update additional fields for that docID



On 28 January 2015 at 17:30, Mark  wrote:

>
> "Create the SID from the existing doc" implies that a document already
> exists that you wish to add fields to.
>
> However if the document is a binary are you suggesting
>
> 1) curl to upload/extract passing docID
> 2) obtain a SID based off docID
> 3) add addtinal fields to SID & commit
>
> I know I'm possibly wandering into the schemaless teritory here as well
>
>
> On 28 January 2015 at 17:11, Andrew Pawloski  wrote:
>
>> I would switch the order of those. Add the new fields and *then* index to
>> solr.
>>
>> We do something similar when we create SolrInputDocuments that are pushed
>> to solr. Create the SID from the existing doc, add any additional fields,
>> then add to solr.
>>
>> On Wed, Jan 28, 2015 at 11:56 AM, Mark  wrote:
>>
>> > Is it possible to use curl to upload a document (for extract & indexing)
>> > and specify some fields on the fly?
>> >
>> > sort of:
>> > 1) index this document
>> > 2) by the way here are some important facets whilst your at it
>> >
>> > Regards
>> >
>> > Mark
>> >
>>
>
>


Re: extract and add fields on the fly

2015-01-28 Thread Mark
I'm looking to

1) upload a binary document using curl
2) add some additional facets

Specifically my question is can this be achieved in 1 curl operation or
does it need 2?

On 28 January 2015 at 17:43, Mark  wrote:

>
> Second thoughts SID is purely i/p as its name suggests :)
>
> I think a better approach would be
>
> 1) curl to upload/extract passing docID
> 2) curl to update additional fields for that docID
>
>
>
> On 28 January 2015 at 17:30, Mark  wrote:
>
>>
>> "Create the SID from the existing doc" implies that a document already
>> exists that you wish to add fields to.
>>
>> However if the document is a binary are you suggesting
>>
>> 1) curl to upload/extract passing docID
>> 2) obtain a SID based off docID
>> 3) add addtinal fields to SID & commit
>>
>> I know I'm possibly wandering into the schemaless teritory here as well
>>
>>
>> On 28 January 2015 at 17:11, Andrew Pawloski  wrote:
>>
>>> I would switch the order of those. Add the new fields and *then* index to
>>> solr.
>>>
>>> We do something similar when we create SolrInputDocuments that are pushed
>>> to solr. Create the SID from the existing doc, add any additional fields,
>>> then add to solr.
>>>
>>> On Wed, Jan 28, 2015 at 11:56 AM, Mark  wrote:
>>>
>>> > Is it possible to use curl to upload a document (for extract &
>>> indexing)
>>> > and specify some fields on the fly?
>>> >
>>> > sort of:
>>> > 1) index this document
>>> > 2) by the way here are some important facets whilst your at it
>>> >
>>> > Regards
>>> >
>>> > Mark
>>> >
>>>
>>
>>
>


Re: extract and add fields on the fly

2015-01-28 Thread Mark
Use case is

use curl to upload/extract/index document passing in additional facets not
present in the document e.g. literal.source="old system"

In this way some fields come from the uploaded extracted content and some
fields as specified in the curl URL

Hope that's clearer?

Regards

Mark


On 28 January 2015 at 17:54, Alexandre Rafalovitch 
wrote:

> Sounds like 'literal.X' syntax from
>
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
>
> Can you explain your use case as different from what's already
> documented? May be easier to understand.
>
> Regards,
>Alex.
> 
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>
>
> On 28 January 2015 at 12:45, Mark  wrote:
> > I'm looking to
> >
> > 1) upload a binary document using curl
> > 2) add some additional facets
> >
> > Specifically my question is can this be achieved in 1 curl operation or
> > does it need 2?
> >
> > On 28 January 2015 at 17:43, Mark  wrote:
> >
> >>
> >> Second thoughts SID is purely i/p as its name suggests :)
> >>
> >> I think a better approach would be
> >>
> >> 1) curl to upload/extract passing docID
> >> 2) curl to update additional fields for that docID
> >>
> >>
> >>
> >> On 28 January 2015 at 17:30, Mark  wrote:
> >>
> >>>
> >>> "Create the SID from the existing doc" implies that a document already
> >>> exists that you wish to add fields to.
> >>>
> >>> However if the document is a binary are you suggesting
> >>>
> >>> 1) curl to upload/extract passing docID
> >>> 2) obtain a SID based off docID
> >>> 3) add addtinal fields to SID & commit
> >>>
> >>> I know I'm possibly wandering into the schemaless teritory here as well
> >>>
> >>>
> >>> On 28 January 2015 at 17:11, Andrew Pawloski 
> wrote:
> >>>
> >>>> I would switch the order of those. Add the new fields and *then*
> index to
> >>>> solr.
> >>>>
> >>>> We do something similar when we create SolrInputDocuments that are
> pushed
> >>>> to solr. Create the SID from the existing doc, add any additional
> fields,
> >>>> then add to solr.
> >>>>
> >>>> On Wed, Jan 28, 2015 at 11:56 AM, Mark  wrote:
> >>>>
> >>>> > Is it possible to use curl to upload a document (for extract &
> >>>> indexing)
> >>>> > and specify some fields on the fly?
> >>>> >
> >>>> > sort of:
> >>>> > 1) index this document
> >>>> > 2) by the way here are some important facets whilst your at it
> >>>> >
> >>>> > Regards
> >>>> >
> >>>> > Mark
> >>>> >
> >>>>
> >>>
> >>>
> >>
>


Re: extract and add fields on the fly

2015-01-28 Thread Mark
That approach works although as suspected the schma has to recognise the
additinal facet (stuff in this case):

"responseHeader":{"status":400,"QTime":1},"error":{"msg":"ERROR:
[doc=6252671B765A1748992DF1A6403BDF81A4A15E00] unknown field
'stuff'","code":400}}

..getting closer..

On 28 January 2015 at 18:03, Mark  wrote:

>
> Use case is
>
> use curl to upload/extract/index document passing in additional facets not
> present in the document e.g. literal.source="old system"
>
> In this way some fields come from the uploaded extracted content and some
> fields as specified in the curl URL
>
> Hope that's clearer?
>
> Regards
>
> Mark
>
>
> On 28 January 2015 at 17:54, Alexandre Rafalovitch 
> wrote:
>
>> Sounds like 'literal.X' syntax from
>>
>> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
>>
>> Can you explain your use case as different from what's already
>> documented? May be easier to understand.
>>
>> Regards,
>>Alex.
>> 
>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>>
>>
>> On 28 January 2015 at 12:45, Mark  wrote:
>> > I'm looking to
>> >
>> > 1) upload a binary document using curl
>> > 2) add some additional facets
>> >
>> > Specifically my question is can this be achieved in 1 curl operation or
>> > does it need 2?
>> >
>> > On 28 January 2015 at 17:43, Mark  wrote:
>> >
>> >>
>> >> Second thoughts SID is purely i/p as its name suggests :)
>> >>
>> >> I think a better approach would be
>> >>
>> >> 1) curl to upload/extract passing docID
>> >> 2) curl to update additional fields for that docID
>> >>
>> >>
>> >>
>> >> On 28 January 2015 at 17:30, Mark  wrote:
>> >>
>> >>>
>> >>> "Create the SID from the existing doc" implies that a document already
>> >>> exists that you wish to add fields to.
>> >>>
>> >>> However if the document is a binary are you suggesting
>> >>>
>> >>> 1) curl to upload/extract passing docID
>> >>> 2) obtain a SID based off docID
>> >>> 3) add addtinal fields to SID & commit
>> >>>
>> >>> I know I'm possibly wandering into the schemaless teritory here as
>> well
>> >>>
>> >>>
>> >>> On 28 January 2015 at 17:11, Andrew Pawloski 
>> wrote:
>> >>>
>> >>>> I would switch the order of those. Add the new fields and *then*
>> index to
>> >>>> solr.
>> >>>>
>> >>>> We do something similar when we create SolrInputDocuments that are
>> pushed
>> >>>> to solr. Create the SID from the existing doc, add any additional
>> fields,
>> >>>> then add to solr.
>> >>>>
>> >>>> On Wed, Jan 28, 2015 at 11:56 AM, Mark  wrote:
>> >>>>
>> >>>> > Is it possible to use curl to upload a document (for extract &
>> >>>> indexing)
>> >>>> > and specify some fields on the fly?
>> >>>> >
>> >>>> > sort of:
>> >>>> > 1) index this document
>> >>>> > 2) by the way here are some important facets whilst your at it
>> >>>> >
>> >>>> > Regards
>> >>>> >
>> >>>> > Mark
>> >>>> >
>> >>>>
>> >>>
>> >>>
>> >>
>>
>
>


Re: extract and add fields on the fly

2015-01-28 Thread Mark
Thanks Alexandre,

I figured it out with this example,

https://wiki.apache.org/solr/ExtractingRequestHandler

whereby you can add additional fields at upload/extract time

curl "
http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_txt&boost.foo_txt=3&literal.blah_s=Bah";
-F "tutorial=@"help.pdf

and therefore I learned that you can't update a field that isn't in the
original which is what I was trying to do before.

Regards

Mark



On 28 January 2015 at 18:38, Alexandre Rafalovitch 
wrote:

> Well, the schema does need to know what type your field is. If you
> can't add it to schema, use dynamicFields with prefixe/suffixes or
> dynamic schema (less recommended).
>
> Regards,
>Alex.
> 
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>
>
> On 28 January 2015 at 13:32, Mark  wrote:
> > That approach works although as suspected the schma has to recognise the
> > additinal facet (stuff in this case):
> >
> > "responseHeader":{"status":400,"QTime":1},"error":{"msg":"ERROR:
> > [doc=6252671B765A1748992DF1A6403BDF81A4A15E00] unknown field
> > 'stuff'","code":400}}
> >
> > ..getting closer..
> >
> > On 28 January 2015 at 18:03, Mark  wrote:
> >
> >>
> >> Use case is
> >>
> >> use curl to upload/extract/index document passing in additional facets
> not
> >> present in the document e.g. literal.source="old system"
> >>
> >> In this way some fields come from the uploaded extracted content and
> some
> >> fields as specified in the curl URL
> >>
> >> Hope that's clearer?
> >>
> >> Regards
> >>
> >> Mark
> >>
> >>
> >> On 28 January 2015 at 17:54, Alexandre Rafalovitch 
> >> wrote:
> >>
> >>> Sounds like 'literal.X' syntax from
> >>>
> >>>
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
> >>>
> >>> Can you explain your use case as different from what's already
> >>> documented? May be easier to understand.
> >>>
> >>> Regards,
> >>>Alex.
> >>> 
> >>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
> >>>
> >>>
> >>> On 28 January 2015 at 12:45, Mark  wrote:
> >>> > I'm looking to
> >>> >
> >>> > 1) upload a binary document using curl
> >>> > 2) add some additional facets
> >>> >
> >>> > Specifically my question is can this be achieved in 1 curl operation
> or
> >>> > does it need 2?
> >>> >
> >>> > On 28 January 2015 at 17:43, Mark  wrote:
> >>> >
> >>> >>
> >>> >> Second thoughts SID is purely i/p as its name suggests :)
> >>> >>
> >>> >> I think a better approach would be
> >>> >>
> >>> >> 1) curl to upload/extract passing docID
> >>> >> 2) curl to update additional fields for that docID
> >>> >>
> >>> >>
> >>> >>
> >>> >> On 28 January 2015 at 17:30, Mark  wrote:
> >>> >>
> >>> >>>
> >>> >>> "Create the SID from the existing doc" implies that a document
> already
> >>> >>> exists that you wish to add fields to.
> >>> >>>
> >>> >>> However if the document is a binary are you suggesting
> >>> >>>
> >>> >>> 1) curl to upload/extract passing docID
> >>> >>> 2) obtain a SID based off docID
> >>> >>> 3) add addtinal fields to SID & commit
> >>> >>>
> >>> >>> I know I'm possibly wandering into the schemaless teritory here as
> >>> well
> >>> >>>
> >>> >>>
> >>> >>> On 28 January 2015 at 17:11, Andrew Pawloski 
> >>> wrote:
> >>> >>>
> >>> >>>> I would switch the order of those. Add the new fields and *then*
> >>> index to
> >>> >>>> solr.
> >>> >>>>
> >>> >>>> We do something similar when we create SolrInputDocuments that are
> >>> pushed
> >>> >>>> to solr. Create the SID from the existing doc, add any additional
> >>> fields,
> >>> >>>> then add to solr.
> >>> >>>>
> >>> >>>> On Wed, Jan 28, 2015 at 11:56 AM, Mark 
> wrote:
> >>> >>>>
> >>> >>>> > Is it possible to use curl to upload a document (for extract &
> >>> >>>> indexing)
> >>> >>>> > and specify some fields on the fly?
> >>> >>>> >
> >>> >>>> > sort of:
> >>> >>>> > 1) index this document
> >>> >>>> > 2) by the way here are some important facets whilst your at it
> >>> >>>> >
> >>> >>>> > Regards
> >>> >>>> >
> >>> >>>> > Mark
> >>> >>>> >
> >>> >>>>
> >>> >>>
> >>> >>>
> >>> >>
> >>>
> >>
> >>
>


Duplicate documents based on attribute

2013-07-25 Thread Mark
How would I go about doing something like this. Not sure if this is something 
that can be accomplished on the index side or its something that should be done 
in our application. 

Say we are an online store for shoes and we are selling Product A in red, blue 
and green. Is there a way when we search for Product A all three results can be 
returned even though they are logically the same item (same product in our 
database).

Thoughts on how this can be accomplished?

Thanks

- M

Re: Duplicate documents based on attribute

2013-07-25 Thread Mark
I was hoping to do this from within Solr, that way I don't have to manually 
mess around with pagination.  The number of items on each page would be 
indeterministic. 
On Jul 25, 2013, at 9:48 AM, Anshum Gupta  wrote:

> Have a multivalued stored 'color' field and just iterate on it outside of
> solr.
> 
> 
> On Thu, Jul 25, 2013 at 10:12 PM, Mark  wrote:
> 
>> How would I go about doing something like this. Not sure if this is
>> something that can be accomplished on the index side or its something that
>> should be done in our application.
>> 
>> Say we are an online store for shoes and we are selling Product A in red,
>> blue and green. Is there a way when we search for Product A all three
>> results can be returned even though they are logically the same item (same
>> product in our database).
>> 
>> Thoughts on how this can be accomplished?
>> 
>> Thanks
>> 
>> - M
> 
> 
> 
> 
> -- 
> 
> Anshum Gupta
> http://www.anshumgupta.net



Alternative searches

2013-07-31 Thread Mark
Can someone explain how one would go about providing alternative searches for a 
query… similar to Amazon.

For example say I search for "Red Dump Truck"

- 0 results for "Red Dump Truck"
- 500 results for " Red Truck"
- 350 results for "Dump Truck"

Does this require multiple searches? 

Thanks

Percolate feature?

2013-08-02 Thread Mark
We have a set number of known terms we want to match against.

In Index:
"term one"
"term two"
"term three"

I know how to match all terms of a user query against the index but we would 
like to know how/if we can match a user's query against all the terms in the 
index?

Search Queries:
"my search term" => 0 matches
"my term search one" => 1 match  ("term one")
"some prefix term two" => 1 match ("term two")
"one two three" => 0 matches

I can only explain this is almost a reverse search??? 

I came across the following from ElasticSearch 
(http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds 
like this may accomplish the above but haven't tested. I was wondering if Solr 
had something similar or an alternative way of accomplishing this?

Thanks



Re: Problems matching delimited field

2013-08-05 Thread Mark
That was it… thanks

On Aug 2, 2013, at 3:27 PM, Shawn Heisey  wrote:

> On 8/2/2013 4:16 PM, Robert Zotter wrote:
>> The problem is the query get's expanded to "1 Foo" not ( "1" OR "Foo")
>> 
>> 1Foo
>> 1Foo
>> +DisjunctionMaxQuery((name_textsv:"1 foo")) ()
>> +(name_textsv:"1 foo") ()
>> 
>> DisMaxQParser
>> 
>> 
> 
> This looks like you have autoGeneratePhraseQueries turned on for the field 
> definition in your schema, either explicitly or by having a "version" 
> attribute in schema.xml of 1.3 or lower.  The current schema version is 1.5.
> 
> Thanks,
> Shawn
> 



Re: Percolate feature?

2013-08-05 Thread Mark
> "can match a user's query against all the terms in the index" - that's 
> exactly what Lucene and Solr have done since Day One, for all queries. 
> Percolate actually does the opposite - matches an input document against a 
> registered set of queries - and doesn't match against indexed documents.
> 
> Solr does support Lucene's "min should match" feature so that you can 
> specify, say, four query terms  and return if at least two match. This is the 
> "mm" parameter.


I don't think you understand me.

Say I only have one document indexed and it's contents are "Foo Bar". I want 
this documented returned if and only if the query has the words "Foo" and "Bar" 
in it. If I use a mm of 100% for "Foo Bar Bazz" this document will not be 
returned because the full user query didn't match. I i use a 0% mm and search 
"Foo Baz" the documented will be returned even though it shouldn't.

On Aug 2, 2013, at 5:09 PM, Jack Krupansky  wrote:

> You seem to be mixing a couple of different concepts here. "Prospective 
> search" or reverse search, (sometimes called alerts) is a logistics matter, 
> but how to match terms is completely different.
> 
> Solr does not have the exact "percolate" feature of ES, but your examples 
> don't indicate a need for what percolate actually does.
> 
> "can match a user's query against all the terms in the index" - that's 
> exactly what Lucene and Solr have done since Day One, for all queries. 
> Percolate actually does the opposite - matches an input document against a 
> registered set of queries - and doesn't match against indexed documents.
> 
> Solr does support Lucene's "min should match" feature so that you can 
> specify, say, four query terms  and return if at least two match. This is the 
> "mm" parameter.
> 
> See:
> http://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29
> 
> Try to clarify your requirements... or maybe min-should-match was all you 
> needed?
> 
> -- Jack Krupansky
> 
> -Original Message- From: Mark
> Sent: Friday, August 02, 2013 7:50 PM
> To: solr-user@lucene.apache.org
> Subject: Percolate feature?
> 
> We have a set number of known terms we want to match against.
> 
> In Index:
> "term one"
> "term two"
> "term three"
> 
> I know how to match all terms of a user query against the index but we would 
> like to know how/if we can match a user's query against all the terms in the 
> index?
> 
> Search Queries:
> "my search term" => 0 matches
> "my term search one" => 1 match  ("term one")
> "some prefix term two" => 1 match ("term two")
> "one two three" => 0 matches
> 
> I can only explain this is almost a reverse search???
> 
> I came across the following from ElasticSearch 
> (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds 
> like this may accomplish the above but haven't tested. I was wondering if 
> Solr had something similar or an alternative way of accomplishing this?
> 
> Thanks
> 



Re: Percolate feature?

2013-08-05 Thread Mark
Still not understanding. How do I know which words to require while searching? 
I want to search across all documents and return ones that have all of their 
terms matched.


>> I came across the following from ElasticSearch 
>> (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds 
>> like this may accomplish the above but haven't tested. I was wondering if 
>> Solr had something similar or an alternative way of accomplishing this?

Also never said this was Percolate, just looked similar

On Aug 5, 2013, at 11:43 AM, "Jack Krupansky"  wrote:

> Fine, then write the query that way:  +foo +bar baz
> 
> But it still doesn't sound as if any of this relates to prospective 
> search/percolate.
> 
> -- Jack Krupansky
> 
> -Original Message- From: Mark
> Sent: Monday, August 05, 2013 2:11 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Percolate feature?
> 
>> "can match a user's query against all the terms in the index" - that's 
>> exactly what Lucene and Solr have done since Day One, for all queries. 
>> Percolate actually does the opposite - matches an input document against a 
>> registered set of queries - and doesn't match against indexed documents.
>> 
>> Solr does support Lucene's "min should match" feature so that you can 
>> specify, say, four query terms  and return if at least two match. This is 
>> the "mm" parameter.
> 
> 
> I don't think you understand me.
> 
> Say I only have one document indexed and it's contents are "Foo Bar". I want 
> this documented returned if and only if the query has the words "Foo" and 
> "Bar" in it. If I use a mm of 100% for "Foo Bar Bazz" this document will not 
> be returned because the full user query didn't match. I i use a 0% mm and 
> search "Foo Baz" the documented will be returned even though it shouldn't.
> 
> On Aug 2, 2013, at 5:09 PM, Jack Krupansky  wrote:
> 
>> You seem to be mixing a couple of different concepts here. "Prospective 
>> search" or reverse search, (sometimes called alerts) is a logistics matter, 
>> but how to match terms is completely different.
>> 
>> Solr does not have the exact "percolate" feature of ES, but your examples 
>> don't indicate a need for what percolate actually does.
>> 
>> "can match a user's query against all the terms in the index" - that's 
>> exactly what Lucene and Solr have done since Day One, for all queries. 
>> Percolate actually does the opposite - matches an input document against a 
>> registered set of queries - and doesn't match against indexed documents.
>> 
>> Solr does support Lucene's "min should match" feature so that you can 
>> specify, say, four query terms  and return if at least two match. This is 
>> the "mm" parameter.
>> 
>> See:
>> http://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29
>> 
>> Try to clarify your requirements... or maybe min-should-match was all you 
>> needed?
>> 
>> -- Jack Krupansky
>> 
>> -Original Message- From: Mark
>> Sent: Friday, August 02, 2013 7:50 PM
>> To: solr-user@lucene.apache.org
>> Subject: Percolate feature?
>> 
>> We have a set number of known terms we want to match against.
>> 
>> In Index:
>> "term one"
>> "term two"
>> "term three"
>> 
>> I know how to match all terms of a user query against the index but we would 
>> like to know how/if we can match a user's query against all the terms in the 
>> index?
>> 
>> Search Queries:
>> "my search term" => 0 matches
>> "my term search one" => 1 match  ("term one")
>> "some prefix term two" => 1 match ("term two")
>> "one two three" => 0 matches
>> 
>> I can only explain this is almost a reverse search???
>> 
>> I came across the following from ElasticSearch 
>> (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds 
>> like this may accomplish the above but haven't tested. I was wondering if 
>> Solr had something similar or an alternative way of accomplishing this?
>> 
>> Thanks



Re: Percolate feature?

2013-08-08 Thread Mark
Ok forget the mention of percolate. 

We have a large list of known keywords we would like to match against. 

Product keyword:  "Sony"
Product keyword:  "Samsung Galaxy"

We would like to be able to detect given a product title whether or not it 
matches any known keywords. For a keyword to be matched all of it's terms must 
be present in the product title given. 

Product Title: "Sony Experia"
Matches and returns a highlight: "Sony Experia"

Product Title: "Samsung 52inch LC"
Does not match

Product Title: "Samsung Galaxy S4"
Matches a returns a highlight: "Samsung Galaxy"

Product Title: "Galaxy Samsung S4"
Matches a returns a highlight: " Galaxy  Samsung"

What would be the best way to approach this? 




On Aug 5, 2013, at 7:02 PM, Chris Hostetter  wrote:

> 
> : Subject: Percolate feature?
> 
> can you give a more concrete, realistic example of what you are trying to 
> do? your synthetic hypothetical example is kind of hard to make sense of.
> 
> your Subject line and comment that the "percolate" feature of elastic 
> search sounds like what you want seems to have some lead people down a 
> path of assuming you want to run these types of queries as documents are 
> indexed -- but that isn't at all clear to me from the way you worded your 
> question other then that.
> 
> it's also not clear what aspect of the "results" you really care about -- 
> are you only looking for the *number* of documents that "match" according 
> to your concept of matching, or are you looking for a list of matches?  
> what multiple documents have all of their terms in the query string -- how 
> should they score relative to eachother?  what if a document contains the 
> same term multiple times, do you expect it to be a match of a query only 
> if that term appears in the query multiple times as well?  do you care 
> about hte ordering of the terms in the query? the ordering of hte terms in 
> the document?
> 
> Ideally: describe for us what you wnat to do, w/o assuming 
> solr/elasticsearch/anything specific about the implementation -- just 
> describe your actual use case for us, with several real document/query 
> examples.
> 
> 
> 
> https://people.apache.org/~hossman/#xyproblem
> XY Problem
> 
> Your question appears to be an "XY Problem" ... that is: you are dealing
> with "X", you are assuming "Y" will help you, and you are asking about "Y"
> without giving more details about the "X" so that we can understand the
> full issue.  Perhaps the best solution doesn't involve "Y" at all?
> See Also: http://www.perlmonks.org/index.pl?node_id=542341
> 
> 
> 
> 
> 
> 
> -Hoss



Re: Percolate feature?

2013-08-09 Thread Mark
> *All* of the terms in the field must be matched by the querynot 
> vice-versa.

Exactly. This is why I was trying to explain it as a reverse search.

I just realized I describe it as a *large list of known keywords when really 
its small; no more than 1000. Forgetting about performance  how hard do you 
think this would be to implement? How should I even start? 

Thanks for the input

On Aug 9, 2013, at 6:56 AM, Yonik Seeley  wrote:

> *All* of the terms in the field must be matched by the querynot 
> vice-versa.
> And no, we don't have a query for that out of the box.  To implement,
> it seems like it would require the total number of terms indexed for a
> field (for each document).
> I guess you could also index start and end tokens and then use query
> expansion to all possible combinations... messy though.
> 
> -Yonik
> http://lucidworks.com
> 
> On Fri, Aug 9, 2013 at 8:19 AM, Erick Erickson  
> wrote:
>> This _looks_ like simple phrase matching (no slop) and highlighting...
>> 
>> But whenever I think the answer is really simple, it usually means
>> that I'm missing something
>> 
>> Best
>> Erick
>> 
>> 
>> On Thu, Aug 8, 2013 at 11:18 PM, Mark  wrote:
>> 
>>> Ok forget the mention of percolate.
>>> 
>>> We have a large list of known keywords we would like to match against.
>>> 
>>> Product keyword:  "Sony"
>>> Product keyword:  "Samsung Galaxy"
>>> 
>>> We would like to be able to detect given a product title whether or not it
>>> matches any known keywords. For a keyword to be matched all of it's terms
>>> must be present in the product title given.
>>> 
>>> Product Title: "Sony Experia"
>>> Matches and returns a highlight: "Sony Experia"
>>> 
>>> Product Title: "Samsung 52inch LC"
>>> Does not match
>>> 
>>> Product Title: "Samsung Galaxy S4"
>>> Matches a returns a highlight: "Samsung Galaxy"
>>> 
>>> Product Title: "Galaxy Samsung S4"
>>> Matches a returns a highlight: " Galaxy  Samsung"
>>> 
>>> What would be the best way to approach this?
>>> 
>>> 
>>> 
>>> 
>>> On Aug 5, 2013, at 7:02 PM, Chris Hostetter 
>>> wrote:
>>> 
>>>> 
>>>> : Subject: Percolate feature?
>>>> 
>>>> can you give a more concrete, realistic example of what you are trying to
>>>> do? your synthetic hypothetical example is kind of hard to make sense of.
>>>> 
>>>> your Subject line and comment that the "percolate" feature of elastic
>>>> search sounds like what you want seems to have some lead people down a
>>>> path of assuming you want to run these types of queries as documents are
>>>> indexed -- but that isn't at all clear to me from the way you worded your
>>>> question other then that.
>>>> 
>>>> it's also not clear what aspect of the "results" you really care about --
>>>> are you only looking for the *number* of documents that "match" according
>>>> to your concept of matching, or are you looking for a list of matches?
>>>> what multiple documents have all of their terms in the query string --
>>> how
>>>> should they score relative to eachother?  what if a document contains the
>>>> same term multiple times, do you expect it to be a match of a query only
>>>> if that term appears in the query multiple times as well?  do you care
>>>> about hte ordering of the terms in the query? the ordering of hte terms
>>> in
>>>> the document?
>>>> 
>>>> Ideally: describe for us what you wnat to do, w/o assuming
>>>> solr/elasticsearch/anything specific about the implementation -- just
>>>> describe your actual use case for us, with several real document/query
>>>> examples.
>>>> 
>>>> 
>>>> 
>>>> https://people.apache.org/~hossman/#xyproblem
>>>> XY Problem
>>>> 
>>>> Your question appears to be an "XY Problem" ... that is: you are dealing
>>>> with "X", you are assuming "Y" will help you, and you are asking about
>>> "Y"
>>>> without giving more details about the "X" so that we can understand the
>>>> full issue.  Perhaps the best solution doesn't involve "Y" at all?
>>>> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -Hoss
>>> 
>>> 



Re: Percolate feature?

2013-08-09 Thread Mark
I'll look into this. Thanks for the concrete example as I don't even know which 
classes to start to look at to implement such a feature.

On Aug 9, 2013, at 9:49 AM, Roman Chyla  wrote:

> On Fri, Aug 9, 2013 at 11:29 AM, Mark  wrote:
> 
>>> *All* of the terms in the field must be matched by the querynot
>> vice-versa.
>> 
>> Exactly. This is why I was trying to explain it as a reverse search.
>> 
>> I just realized I describe it as a *large list of known keywords when
>> really its small; no more than 1000. Forgetting about performance  how hard
>> do you think this would be to implement? How should I even start?
>> 
> 
> not hard, index all terms into a field - make sure there are no duplicates,
> as you want to count them - then I can imagine at least two options: save
> the number of terms as a payload together with the terms, or in second step
> (in a collector, for example), load the document and count them terms in
> the field - if they match the query size, you are done
> 
> a trivial, naive implementation (as you say 'forget performance') could be:
> 
> searcher.search(query, null, new Collector() {
>  ...
>  public void collect(int i) throws Exception {
> d = reader.document(i, fieldsToLoa);
> if (d.getValues(fieldToLoad).size() == query.size()) {
>PriorityQueue.add(new ScoreDoc(score, i + docBase));
> }
>  }
> }
> 
> so if your query contains no duplicates and all terms must match, you can
> be sure that you are collecting docs only when the number of terms matches
> number of clauses in the query
> 
> roman
> 
> 
>> Thanks for the input
>> 
>> On Aug 9, 2013, at 6:56 AM, Yonik Seeley  wrote:
>> 
>>> *All* of the terms in the field must be matched by the querynot
>> vice-versa.
>>> And no, we don't have a query for that out of the box.  To implement,
>>> it seems like it would require the total number of terms indexed for a
>>> field (for each document).
>>> I guess you could also index start and end tokens and then use query
>>> expansion to all possible combinations... messy though.
>>> 
>>> -Yonik
>>> http://lucidworks.com
>>> 
>>> On Fri, Aug 9, 2013 at 8:19 AM, Erick Erickson 
>> wrote:
>>>> This _looks_ like simple phrase matching (no slop) and highlighting...
>>>> 
>>>> But whenever I think the answer is really simple, it usually means
>>>> that I'm missing something
>>>> 
>>>> Best
>>>> Erick
>>>> 
>>>> 
>>>> On Thu, Aug 8, 2013 at 11:18 PM, Mark 
>> wrote:
>>>> 
>>>>> Ok forget the mention of percolate.
>>>>> 
>>>>> We have a large list of known keywords we would like to match against.
>>>>> 
>>>>> Product keyword:  "Sony"
>>>>> Product keyword:  "Samsung Galaxy"
>>>>> 
>>>>> We would like to be able to detect given a product title whether or
>> not it
>>>>> matches any known keywords. For a keyword to be matched all of it's
>> terms
>>>>> must be present in the product title given.
>>>>> 
>>>>> Product Title: "Sony Experia"
>>>>> Matches and returns a highlight: "Sony Experia"
>>>>> 
>>>>> Product Title: "Samsung 52inch LC"
>>>>> Does not match
>>>>> 
>>>>> Product Title: "Samsung Galaxy S4"
>>>>> Matches a returns a highlight: "Samsung Galaxy"
>>>>> 
>>>>> Product Title: "Galaxy Samsung S4"
>>>>> Matches a returns a highlight: " Galaxy  Samsung"
>>>>> 
>>>>> What would be the best way to approach this?
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Aug 5, 2013, at 7:02 PM, Chris Hostetter 
>>>>> wrote:
>>>>> 
>>>>>> 
>>>>>> : Subject: Percolate feature?
>>>>>> 
>>>>>> can you give a more concrete, realistic example of what you are
>> trying to
>>>>>> do? your synthetic hypothetical example is kind of hard to make sense
>> of.
>>>>>> 
>>>>>> your Subject line and comment that the "percolate" feature of elastic
>>>>>> search sounds like what you want seems to have some lead people down a

Re: Percolate feature?

2013-08-10 Thread Mark
> So to reiteratve your examples from before, but change the "labels" a 
> bit and add some more converse examples (and ignore the "highlighting" 
> aspect for a moment...
> 
> doc1 = "Sony"
> doc2 = "Samsung Galaxy"
> doc3 = "Sony Playstation"
> 
> queryA = "Sony Experia"   ... matches only doc1
> queryB = "Sony Playstation 3" ... matches doc3 and doc1
> queryC = "Samsung 52inch LC"  ... doesn't match anything
> queryD = "Samsung Galaxy S4"  ... matches doc2
> queryE = "Galaxy Samsung S4"  ... matches doc2
> 
> 
> ...do i still have that correct?

Yes

> 2) if you *do* care about using non-trivial analysis, then you can't use 
> the simple "termfreq()" function, which deals with raw terms -- in stead 
> you have to use the "query()" function to ensure that the input is parsed 
> appropriately -- but then you have to wrap that function in something that 
> will normalize the scores - so in place of termfreq('words','Galaxy') 
> you'd want something like...


Yes we will be using non-trivial analysis. Now heres another twist… what if we 
don't care about scoring?


Let's talk about the real use case. We are marketplace that sells products that 
users have listed. For certain popular, high risk or restricted keywords we 
charge the seller an extra fee/ban the listing. We now have sellers purposely 
misspelling their listings to circumvent this fee. They will start adding 
suffixes to their product listings such as "Sonies" knowing that it gets 
indexed down to "Sony" and thus matching a users query for Sony. Or they will 
munge together numbers and products… "2013Sony". Same thing goes for adding 
crazy non-ascii characters to the front of the keyword "Î’Sony". This is 
obviously a problem because we aren't charging for these keywords and more 
importantly it makes our search results look like shit. 

We would like to:

1) Detect when a certain keyword is in a product title at listing time so we 
may charge the seller. This was my idea of a "reverse search" although sounds 
like I may have caused to much confusion with that term.
2) Attempt to autocorrect these titles hence the need for highlighting so we 
can try and replace the terms… this of course done outside of Solr via an 
external service.

Since we do some stemming (KStemmer) and filtering (WordDelimiterFilterFactory) 
this makes conventional approaches such as regex quite troublesome. Regex is 
also quite slow and scales horribly and always needs to be in lockstep with 
schema changes.

Now knowing this, is there a good way to approach this?

Thanks


On Aug 9, 2013, at 11:56 AM, Chris Hostetter  wrote:

> 
> : I'll look into this. Thanks for the concrete example as I don't even 
> : know which classes to start to look at to implement such a feature.
> 
> Either roman isn't understanding what you are aksing for, or i'm not -- 
> but i don't think what roman described will work for you...
> 
> : > so if your query contains no duplicates and all terms must match, you can
> : > be sure that you are collecting docs only when the number of terms matches
> : > number of clauses in the query
> 
> several of the examples you gave did not match what Roman is describing, 
> as i understand it.  Most people on this thread seem to be getting 
> confused by having their perceptions "flipped" about what your "data known 
> in advance is" vs the "data you get at request time".
> 
> You described this...
> 
> : > Product keyword:  "Sony"
> : > Product keyword:  "Samsung Galaxy"
> : > 
> : > We would like to be able to detect given a product title whether or
> : >> not it
> : > matches any known keywords. For a keyword to be matched all of it's
> : >> terms
> : > must be present in the product title given.
> : > 
> : > Product Title: "Sony Experia"
> : > Matches and returns a highlight: "Sony Experia"
> 
> ...suggesting that what you call "product keywords" are the "data you know 
> about in advance" and "product titles" are the data you get at request 
> time.
> 
> So your example of the "request time" input (ie: query) "Sony Experia" 
> matching "data known in advance (ie: indexed document) "Sony" would not 
> work with Roman's example.
> 
> To rephrase (what i think i understand is) your goal...
> 
> * you have many (10*3+) documents known in advance
> * any document D contain a set of words W(D) of varing sizes
> * any requests Q contains a set of words W(Q) of varing izes
> * you want a given request R to match a document D if and only if:
>   - W(D) is a subset of W(Q)
>   - ie: no iten exists in W(D) that does not exist in W(Q)
>   - ie: any number of items may exist in W(Q) that are not in W(D)
> 
> So to reiteratve your examples from before, but change the "labels" a 
> bit and add some more converse examples (and ignore the "highlighting" 
> aspect for a moment...
> 
> doc1 = "Sony"
> doc2 = "Samsung Galaxy"
> doc3 = "Sony Playstation"
> 
> queryA = "Sony Experia"   ... matches only doc1
> queryB = "Sony Playsta

Re: Percolate feature?

2013-08-10 Thread Mark
Our schema is pretty basic.. nothing fancy going on here


  






  
   







  



On Aug 10, 2013, at 3:40 PM, "Jack Krupansky"  wrote:

> Now we're getting somewhere!
> 
> To (over-simplify), you simply want to know if a given "listing" would match 
> a high-value pattern, either in a "clean" manner (obvious keywords) or in an 
> "unclean" manner (e.g., fuzzy keyword matching, stemming, n-grams.)
> 
> To a large this also depends on how rich and powerful your end-user query 
> support is. So, if the user searches for "sony", "samsung", or "apple", will 
> it match some oddball listing that fuzzily matches those terms.
> 
> So... tell us, how rich your query interface is. I mean, do you support 
> wildcard, fuzzy query, ngrams (e.g., can they type "son" or "sam" or "app", 
> or... will "sony" match "sonblah-blah")?
> 
> Reverse-search may in fact be what you need in this case since you literally 
> do mean "if I index this document, will it match any of these queries" (but 
> doesn't score a hit on your direct check for whether it is a clean keyword 
> match.)
> 
> In your previous examples you only gave clean product titles, not examples of 
> circumventions of simple keyword matches.
> 
> -- Jack Krupansky
> 
> -Original Message- From: Mark
> Sent: Saturday, August 10, 2013 6:24 PM
> To: solr-user@lucene.apache.org
> Cc: Chris Hostetter
> Subject: Re: Percolate feature?
> 
>> So to reiteratve your examples from before, but change the "labels" a
>> bit and add some more converse examples (and ignore the "highlighting"
>> aspect for a moment...
>> 
>> doc1 = "Sony"
>> doc2 = "Samsung Galaxy"
>> doc3 = "Sony Playstation"
>> 
>> queryA = "Sony Experia"   ... matches only doc1
>> queryB = "Sony Playstation 3" ... matches doc3 and doc1
>> queryC = "Samsung 52inch LC"  ... doesn't match anything
>> queryD = "Samsung Galaxy S4"  ... matches doc2
>> queryE = "Galaxy Samsung S4"  ... matches doc2
>> 
>> 
>> ...do i still have that correct?
> 
> Yes
> 
>> 2) if you *do* care about using non-trivial analysis, then you can't use
>> the simple "termfreq()" function, which deals with raw terms -- in stead
>> you have to use the "query()" function to ensure that the input is parsed
>> appropriately -- but then you have to wrap that function in something that
>> will normalize the scores - so in place of termfreq('words','Galaxy')
>> you'd want something like...
> 
> 
> Yes we will be using non-trivial analysis. Now heres another twist… what if 
> we don't care about scoring?
> 
> 
> Let's talk about the real use case. We are marketplace that sells products 
> that users have listed. For certain popular, high risk or restricted keywords 
> we charge the seller an extra fee/ban the listing. We now have sellers 
> purposely misspelling their listings to circumvent this fee. They will start 
> adding suffixes to their product listings such as "Sonies" knowing that it 
> gets indexed down to "Sony" and thus matching a users query for Sony. Or they 
> will munge together numbers and products… "2013Sony". Same thing goes for 
> adding crazy non-ascii characters to the front of the keyword "Î’Sony". This 
> is obviously a problem because we aren't charging for these keywords and more 
> importantly it makes our search results look like shit.
> 
> We would like to:
> 
> 1) Detect when a certain keyword is in a product title at listing time so we 
> may charge the seller. This was my idea of a "reverse search" although sounds 
> like I may have caused to much confusion with that term.
> 2) Attempt to autocorrect these titles hence the need for highlighting so we 
> can try and replace the terms… this of course done outside of Solr via an 
> external service.
> 
> Since we do some stemming (KStemmer) and filtering 
> (WordDelimiterFilterFactory) this makes conventional approaches such as regex 
> quite troublesome. Regex is also quite slow and scales horribly and always 
> needs to be in lockstep with schema changes.
> 
> Now knowing this, is there a good way to approach this?
> 
> Thanks
> 
> 
> On Aug 9, 2013, at 11:56 AM, Chris Hostetter  wrote:
> 
>> 
>> : I'll look into t

Re: Percolate feature?

2013-08-13 Thread Mark
Any ideas?

On Aug 10, 2013, at 6:28 PM, Mark  wrote:

> Our schema is pretty basic.. nothing fancy going on here
> 
> 
>  
>
>
> protected="protected.txt"/>
> generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
> preserveOriginal="1"/>
>
>
>  
>   
>
>
> ignoreCase="true" expand="true"/>
> protected="protected.txt"/>
> generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" 
> preserveOriginal="1"/>
>
>
>  
>
> 
> 
> On Aug 10, 2013, at 3:40 PM, "Jack Krupansky"  wrote:
> 
>> Now we're getting somewhere!
>> 
>> To (over-simplify), you simply want to know if a given "listing" would match 
>> a high-value pattern, either in a "clean" manner (obvious keywords) or in an 
>> "unclean" manner (e.g., fuzzy keyword matching, stemming, n-grams.)
>> 
>> To a large this also depends on how rich and powerful your end-user query 
>> support is. So, if the user searches for "sony", "samsung", or "apple", will 
>> it match some oddball listing that fuzzily matches those terms.
>> 
>> So... tell us, how rich your query interface is. I mean, do you support 
>> wildcard, fuzzy query, ngrams (e.g., can they type "son" or "sam" or "app", 
>> or... will "sony" match "sonblah-blah")?
>> 
>> Reverse-search may in fact be what you need in this case since you literally 
>> do mean "if I index this document, will it match any of these queries" (but 
>> doesn't score a hit on your direct check for whether it is a clean keyword 
>> match.)
>> 
>> In your previous examples you only gave clean product titles, not examples 
>> of circumventions of simple keyword matches.
>> 
>> -- Jack Krupansky
>> 
>> -Original Message- From: Mark
>> Sent: Saturday, August 10, 2013 6:24 PM
>> To: solr-user@lucene.apache.org
>> Cc: Chris Hostetter
>> Subject: Re: Percolate feature?
>> 
>>> So to reiteratve your examples from before, but change the "labels" a
>>> bit and add some more converse examples (and ignore the "highlighting"
>>> aspect for a moment...
>>> 
>>> doc1 = "Sony"
>>> doc2 = "Samsung Galaxy"
>>> doc3 = "Sony Playstation"
>>> 
>>> queryA = "Sony Experia"   ... matches only doc1
>>> queryB = "Sony Playstation 3" ... matches doc3 and doc1
>>> queryC = "Samsung 52inch LC"  ... doesn't match anything
>>> queryD = "Samsung Galaxy S4"  ... matches doc2
>>> queryE = "Galaxy Samsung S4"  ... matches doc2
>>> 
>>> 
>>> ...do i still have that correct?
>> 
>> Yes
>> 
>>> 2) if you *do* care about using non-trivial analysis, then you can't use
>>> the simple "termfreq()" function, which deals with raw terms -- in stead
>>> you have to use the "query()" function to ensure that the input is parsed
>>> appropriately -- but then you have to wrap that function in something that
>>> will normalize the scores - so in place of termfreq('words','Galaxy')
>>> you'd want something like...
>> 
>> 
>> Yes we will be using non-trivial analysis. Now heres another twist… what if 
>> we don't care about scoring?
>> 
>> 
>> Let's talk about the real use case. We are marketplace that sells products 
>> that users have listed. For certain popular, high risk or restricted 
>> keywords we charge the seller an extra fee/ban the listing. We now have 
>> sellers purposely misspelling their listings to circumvent this fee. They 
>> will start adding suffixes to their product listings such as "Sonies" 
>> knowing that it gets indexed down to "Sony" and thus matching a users query 
>> for Sony. Or they will munge together numbers and products… "2013Sony". Same 
>> thing goes for adding crazy non-ascii characters to the front of the keyword 
>> "Î’Sony". This is obviously a problem because we aren't charging for these 
>> keywords and more importantly it makes our search results look like shit.
>> 
>> We would like to:
&g

App server?

2013-10-02 Thread Mark
Is Jetty sufficient for running Solr or should I go with something a little 
more enterprise like tomcat?

Any others?

SolrJ best pratices

2013-10-07 Thread Mark
Are there any links describing best practices for interacting with SolrJ? I've 
checked the wiki and it seems woefully incomplete: 
(http://wiki.apache.org/solr/Solrj)

Some specific questions:
- When working with HttpSolrServer should we keep around instances for ever or 
should we create a singleton that can/should be used over and over? 
- Is there a way to change the collection after creating the server or do we 
need to create a new server for each collection?
-..





Bootstrapping / Full Importing using Solr Cloud

2013-10-08 Thread Mark
We are in the process of upgrading our Solr cluster to the latest and greatest 
Solr Cloud. I have some questions regarding full indexing though. We're 
currently running a long job (~30 hours) using DIH to do a full index on over 
10M products. This process consumes a lot of memory and while updating can not 
handle any user requests. 

How, or what would be the best way going about this when using Solr Cloud? 
First off, does DIH work with cloud? Would I need to separate out my DIH 
indexing machine from the machines serving up user requests? If not going down 
the DIH route, what are my best options (solrj?) 

Thanks for the input

Re: SolrJ best pratices

2013-10-09 Thread Mark
Thanks for the clarification.

In Solr Cloud just use 1 connection. In non-cloud environments you will need 
one per core.



On Oct 8, 2013, at 5:58 PM, Shawn Heisey  wrote:

> On 10/7/2013 3:08 PM, Mark wrote:
>> Some specific questions:
>> - When working with HttpSolrServer should we keep around instances for ever 
>> or should we create a singleton that can/should be used over and over?
>> - Is there a way to change the collection after creating the server or do we 
>> need to create a new server for each collection?
> 
> If at all possible, you should create your server object and use it for the 
> life of your application.  SolrJ is threadsafe.  If there is any part of it 
> that's not, the javadocs should say so - the SolrServer implementations 
> definitely are.
> 
> By using the word "collection" you are implying that you are using SolrCloud 
> ... but earlier you said HttpSolrServer, which implies that you are NOT using 
> SolrCloud.
> 
> With HttpSolrServer, your base URL includes the core or collection name - 
> "http://server:port/solr/corename"; for example.  Generally you will need one 
> object for each core/collection, and another object for server-level things 
> like CoreAdmin.
> 
> With SolrCloud, you should be using CloudSolrServer instead, another 
> implementation of SolrServer that is constantly aware of the SolrCloud 
> clusterstate.  With that object, you can use setDefaultCollection, and you 
> can also add a "collection" parameter to each SolrQuery or other request 
> object.
> 
> Thanks,
> Shawn
> 



Setting SolrCloudServer collection

2013-10-11 Thread Mark
If using one static SolrCloudServer how can I add a bean to a certain 
collection. Do I need to update setDefaultCollection() each time? I doubt that 
thread safe?

Thanks

Re: DIH importing

2011-08-29 Thread Mark

Thanks Ill give that a try

On 8/26/11 9:54 AM, simon wrote:

It sounds as though you are optimizing the index after the delta import. If
you don't do that, then only new segments will be replicated and syncing
will be much faster.


On Fri, Aug 26, 2011 at 12:08 PM, Mark  wrote:


We are currently delta-importing using DIH after which all of our servers
have to download the full index (16G). This obviously puts quite a strain on
our slaves while they are syncing over the index. Is there anyway not to
sync over the whole index, but rather just the parts that have changed?

We would like to get to the point where are no longer using DIH but rather
we are constantly sending documents over HTTP to our master in realtime. We
would then like our slaves to download these changes as soon as possible. Is
something like this even possible?

Thanks for you help



Searching multiple fields

2011-09-26 Thread Mark
I have a use case where I would like to search across two fields but I 
do not want to weight a document that has a match in both fields higher 
than a document that has a match in only 1 field.


For example.

Document 1
 - Field A: "Foo Bar"
 - Field B: "Foo Baz"

Document 2
 - Field A: "Foo Blarg"
 - Field B: "Something else"

Now when I search for "Foo" I would like document 1 and 2 to be 
similarly scored however document 1 will be scored much higher in this 
use case because it matches in both fields. I could create a third field 
and use copyField directive to search across that but I was wondering if 
there is an alternative way. It would be nice if we could search across 
some sort of "virtual field" that will use both underlying fields but 
not actually increase the size of the index.


Thanks


Re: Searching multiple fields

2011-09-27 Thread Mark
I thought that a similarity class will only affect the scoring of a 
single field.. not across multiple fields? Can anyone else chime in with 
some input? Thanks.


On 9/26/11 9:02 PM, Otis Gospodnetic wrote:

Hi Mark,

Eh, I don't have Lucene/Solr source code handy, but I *think* for that you'd 
need to write custom Lucene similarity.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



____
From: Mark
To: solr-user@lucene.apache.org
Sent: Monday, September 26, 2011 8:12 PM
Subject: Searching multiple fields

I have a use case where I would like to search across two fields but I do not 
want to weight a document that has a match in both fields higher than a 
document that has a match in only 1 field.

For example.

Document 1
- Field A: "Foo Bar"
- Field B: "Foo Baz"

Document 2
- Field A: "Foo Blarg"
- Field B: "Something else"

Now when I search for "Foo" I would like document 1 and 2 to be similarly scored however 
document 1 will be scored much higher in this use case because it matches in both fields. I could 
create a third field and use copyField directive to search across that but I was wondering if there 
is an alternative way. It would be nice if we could search across some sort of "virtual 
field" that will use both underlying fields but not actually increase the size of the index.

Thanks





HBase Datasource

2011-11-10 Thread Mark
Has anyone had any success/experience with building a HBase datasource 
for DIH? Are there any solutions available on the web?


Thanks.


CachedSqlEntityProcessor

2011-11-15 Thread Mark
I am trying to use the CachedSqlEntityProcessor with Solr 1.4.2 however 
I am not seeing any performance gains. I've read some other posts that 
reference cacheKey and cacheLookup however I don't see any reference to 
them in the wiki

http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor

Can someone please clarify?

Thanks



Re: CachedSqlEntityProcessor

2011-11-15 Thread Mark

FYI my sub-entity looks like the following



On 11/15/11 10:42 AM, Mark wrote:
I am trying to use the CachedSqlEntityProcessor with Solr 1.4.2 
however I am not seeing any performance gains. I've read some other 
posts that reference cacheKey and cacheLookup however I don't see any 
reference to them in the wiki

http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor

Can someone please clarify?

Thanks



Multithreaded DIH bug

2011-12-01 Thread Mark
I'm trying to use multiple threads with DIH but I keep receiving the 
following error.. "Operation not allowed after ResultSet closed"


Is there anyway I can fix this?

Dec 1, 2011 4:38:47 PM org.apache.solr.common.SolrException log
SEVERE: Full Import failed:java.lang.RuntimeException: Error in 
multi-threaded import
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)
Caused by: 
org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.sql.SQLException: Operation not allowed after ResultSet closed
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:339)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$600(JdbcDataSource.java:228)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:262)
at 
org.apache.solr.handler.dataimport.CachedSqlEntityProcessor.getAllNonCachedRows(CachedSqlEntityProcessor.java:72)
at 
org.apache.solr.handler.dataimport.EntityProcessorBase.getIdCacheData(EntityProcessorBase.java:201)
at 
org.apache.solr.handler.dataimport.CachedSqlEntityProcessor.nextRow(CachedSqlEntityProcessor.java:60)
at 
org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper.nextRow(ThreadedEntityProcessorWrapper.java:84)
at 
org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.runAThread(DocBuilder.java:449)
at 
org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.run(DocBuilder.java:402)
at 
org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.runAThread(DocBuilder.java:469)
at 
org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.access$000(DocBuilder.java:356)
at 
org.apache.solr.handler.dataimport.DocBuilder$EntityRunner$1.run(DocBuilder.java:409)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)

at java.lang.Thread.run(Thread.java:636)
Caused by: java.sql.SQLException: Operation not allowed after ResultSet 
closed

at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927)
at com.mysql.jdbc.ResultSetImpl.checkClosed(ResultSetImpl.java:794)
at com.mysql.jdbc.ResultSetImpl.next(ResultSetImpl.java:7139)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:331)

... 14 more



Re: Multithreaded DIH bug

2011-12-02 Thread Mark

Thanks for the info

On 12/2/11 1:29 AM, Mikhail Khludnev wrote:

Hello,

AFAIK Particularly this exception is not a big deal. It's just one of the
evidence of the fact that CachedSqlEntityProcessor doesn't work in multiple
threads at 3.x and 4.0. It's discussed at
http://search-lucene.com/m/0DNn32L2UBv

the most problem here is the following messages in the log

org.apache.solr.handler.dataimport.*ThreadedEntityProcessorWrapper. nextRow*
*arow : null*

Some time ago I did the patch for 3.4 (pretty raw) you can try it
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201110.mbox/browser
I plan (but only plan, sorry) to address it at 4.0 where SOLR-2382
refactoring has been applied recently.

Regards

On Fri, Dec 2, 2011 at 4:57 AM, Mark  wrote:


I'm trying to use multiple threads with DIH but I keep receiving the
following error.. "Operation not allowed after ResultSet closed"

Is there anyway I can fix this?

Dec 1, 2011 4:38:47 PM org.apache.solr.common.**SolrException log
SEVERE: Full Import failed:java.lang.**RuntimeException: Error in
multi-threaded import
at org.apache.solr.handler.**dataimport.DocBuilder.**
doFullDump(DocBuilder.java:**268)
at org.apache.solr.handler.**dataimport.DocBuilder.execute(**
DocBuilder.java:187)
at org.apache.solr.handler.**dataimport.DataImporter.**
doFullImport(DataImporter.**java:359)
at org.apache.solr.handler.**dataimport.DataImporter.**
runCmd(DataImporter.java:427)
at org.apache.solr.handler.**dataimport.DataImporter$1.run(**
DataImporter.java:408)
Caused by: org.apache.solr.handler.**dataimport.**DataImportHandlerException:
java.sql.SQLException: Operation not allowed after ResultSet closed
at org.apache.solr.handler.**dataimport.**DataImportHandlerException.**
wrapAndThrow(**DataImportHandlerException.**java:64)
at org.apache.solr.handler.**dataimport.JdbcDataSource$**
ResultSetIterator.hasnext(**JdbcDataSource.java:339)
at org.apache.solr.handler.**dataimport.JdbcDataSource$**
ResultSetIterator.access$600(**JdbcDataSource.java:228)
at org.apache.solr.handler.**dataimport.JdbcDataSource$**
ResultSetIterator$1.hasNext(**JdbcDataSource.java:262)
at org.apache.solr.handler.**dataimport.**CachedSqlEntityProcessor.**
getAllNonCachedRows(**CachedSqlEntityProcessor.java:**72)
at org.apache.solr.handler.**dataimport.**EntityProcessorBase.**
getIdCacheData(**EntityProcessorBase.java:201)
at org.apache.solr.handler.**dataimport.**CachedSqlEntityProcessor.**
nextRow(**CachedSqlEntityProcessor.java:**60)
at org.apache.solr.handler.**dataimport.**
ThreadedEntityProcessorWrapper**.nextRow(**ThreadedEntityProcessorWrapper*
*.java:84)
at org.apache.solr.handler.**dataimport.DocBuilder$**
EntityRunner.runAThread(**DocBuilder.java:449)
at org.apache.solr.handler.**dataimport.DocBuilder$**
EntityRunner.run(DocBuilder.**java:402)
at org.apache.solr.handler.**dataimport.DocBuilder$**
EntityRunner.runAThread(**DocBuilder.java:469)
at org.apache.solr.handler.**dataimport.DocBuilder$**
EntityRunner.access$000(**DocBuilder.java:356)
at org.apache.solr.handler.**dataimport.DocBuilder$**
EntityRunner$1.run(DocBuilder.**java:409)
at java.util.concurrent.**ThreadPoolExecutor.runWorker(**
ThreadPoolExecutor.java:1110)
at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**
ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.**java:636)
Caused by: java.sql.SQLException: Operation not allowed after ResultSet
closed
at com.mysql.jdbc.SQLError.**createSQLException(SQLError.**java:1073)
at com.mysql.jdbc.SQLError.**createSQLException(SQLError.**java:987)
at com.mysql.jdbc.SQLError.**createSQLException(SQLError.**java:982)
at com.mysql.jdbc.SQLError.**createSQLException(SQLError.**java:927)
at com.mysql.jdbc.ResultSetImpl.**checkClosed(ResultSetImpl.**java:794)
at com.mysql.jdbc.ResultSetImpl.**next(ResultSetImpl.java:7139)
at org.apache.solr.handler.**dataimport.JdbcDataSource$**
ResultSetIterator.hasnext(**JdbcDataSource.java:331)
... 14 more






Question on DIH delta imports

2011-12-05 Thread Mark
*pk*: The primary key for the entity. It is*optional*and only needed 
when using delta-imports. It has no relation to the uniqueKey defined in 
schema.xml but they both can be the same.


When using in a nested entity is the PK the primary key column of the 
join table or the key used for joining? For example


Say I have table /foo/ whose primary id is id and table /sub_foo/ whose 
primary key is also id. Table sub_foo also has a column named/foo_id/ 
that is used for joining /foo/ and /sub_foo/.


Should it be:

 
 
 

Or:

 

 


Thanks for your help




Re: Question on DIH delta imports

2011-12-06 Thread Mark

Anyone?

On 12/5/11 11:04 AM, Mark wrote:
*pk*: The primary key for the entity. It is*optional*and only needed 
when using delta-imports. It has no relation to the uniqueKey defined 
in schema.xml but they both can be the same.


When using in a nested entity is the PK the primary key column of the 
join table or the key used for joining? For example


Say I have table /foo/ whose primary id is id and table /sub_foo/ 
whose primary key is also id. Table sub_foo also has a column 
named/foo_id/ that is used for joining /foo/ and /sub_foo/.


Should it be:
  
  
  
Or:
  
 
  

Thanks for your help




Design questions/Schema Help

2010-07-26 Thread Mark
We are thinking about using Cassandra to store our search logs. Can 
someone point me in the right direction/lend some guidance on design? I 
am new to Cassandra and I am having trouble wrapping my head around some 
of these new concepts. My brain keeps wanting to go back to a RDBMS design.


We will be storing the user query, # of hits returned and their session 
id. We would like to be able to answer the following questions.


- What is the n most popular queries and their counts within the last x 
(mins/hours/days/etc). Basically the most popular searches within a 
given time range.
- What is the most popular query within the last x where hits = 0. Same 
as above but with an extra "where" clause

- For session id x give me all their other queries
- What are all the session ids that searched for 'foos'

We accomplish the above functionality w/ MySQL using 2 tables. One for 
the raw search log information and the other to keep the 
aggregate/running counts of queries.


Would this sort of ad-hoc querying be better implemented using Hadoop + 
Hive? If so, should I be storing all this information in Cassandra then 
using Hadoop to retrieve it?


Thanks for your suggestions


Re: Design questions/Schema Help

2010-07-26 Thread Mark

On 7/26/10 4:43 PM, Mark wrote:
We are thinking about using Cassandra to store our search logs. Can 
someone point me in the right direction/lend some guidance on design? 
I am new to Cassandra and I am having trouble wrapping my head around 
some of these new concepts. My brain keeps wanting to go back to a 
RDBMS design.


We will be storing the user query, # of hits returned and their 
session id. We would like to be able to answer the following questions.


- What is the n most popular queries and their counts within the last 
x (mins/hours/days/etc). Basically the most popular searches within a 
given time range.
- What is the most popular query within the last x where hits = 0. 
Same as above but with an extra "where" clause

- For session id x give me all their other queries
- What are all the session ids that searched for 'foos'

We accomplish the above functionality w/ MySQL using 2 tables. One for 
the raw search log information and the other to keep the 
aggregate/running counts of queries.


Would this sort of ad-hoc querying be better implemented using Hadoop 
+ Hive? If so, should I be storing all this information in Cassandra 
then using Hadoop to retrieve it?


Thanks for your suggestions 

Whoops wrong forum


Solr crawls during replication

2010-07-26 Thread Mark
We have an index around 25-30G w/ 1 master and 5 slaves. We perform 
replication every 30 mins. During replication the disk I/O obviously 
shoots up on the slaves to the point where all requests routed to that 
slave take a really long time... sometimes to the point of timing out.


Is there any logical or physical changes we could make to our 
architecture to overcome this problem?


Thanks


DIH and Cassandra

2010-08-04 Thread Mark
Is it possible to use DIH with Cassandra either out of the box or with 
something more custom? Thanks


Throttling replication

2010-09-02 Thread Mark
 Is there any way or forthcoming patch that would allow configuration 
of how much network bandwith (and ultimately disk I/O) a slave is 
allowed during replication? We have the current problem of while 
replicating our disk I/O goes through the roof. I would much rather have 
the replication take 2x as long with half the disk I/O? Any thoughts?


Thanks


Re: Solr crawls during replication

2010-09-02 Thread Mark

 On 8/6/10 5:03 PM, Chris Hostetter wrote:

: We have an index around 25-30G w/ 1 master and 5 slaves. We perform
: replication every 30 mins. During replication the disk I/O obviously shoots up
: on the slaves to the point where all requests routed to that slave take a
: really long time... sometimes to the point of timing out.
:
: Is there any logical or physical changes we could make to our architecture to
: overcome this problem?

If the problem really is disk I/O then perhaps you don't have enough RAM
set asside for the filesystem cache to keep the "current" index in memory?

I've seen people have this type of problem before, but usually it's
network I/O that is the bottleneck, in which case using multiple NICs on
your slaves (one for client requests, one for replication)

I think at one point there was also talk about leveraging an rsync option
to force snappuller to throttle itself an only use a max amount of
bandwidth -- but then we moved away from script based replication to java
based replication and i don't think the Java Network/IO system suports
that type of throttling.  However: you might be able to configure it in
your switches/routers (ie: only let the slaves use X% of their total
badwidth to talk to the master)


-Hoss

Thanks for the suggestions. Our slaves have 12G with 10G dedicated to 
the JVM.. too much?


Are the rysnc snappuller featurs still available in 1.4.1? I may try 
that to see if helps. Configuration of the switches may also be possible.


Also, would you mind explaining your second point... using dual NIC 
cards. How can this be accomplished/configured. Thanks for you help


Re: Throttling replication

2010-09-02 Thread Mark

 On 9/2/10 8:27 AM, Noble Paul നോബിള്‍ नोब्ळ् wrote:

There is no way to currently throttle replication. It consumes the
whole bandwidth available. It is a nice to have feature

On Thu, Sep 2, 2010 at 8:11 PM, Mark  wrote:

  Is there any way or forthcoming patch that would allow configuration of how
much network bandwith (and ultimately disk I/O) a slave is allowed during
replication? We have the current problem of while replicating our disk I/O
goes through the roof. I would much rather have the replication take 2x as
long with half the disk I/O? Any thoughts?

Thanks




Do you mean the consuming the whole bandwith available is a nice to have 
feature or the ability to throttle the bandwith?


Re: Throttling replication

2010-09-02 Thread Mark

 On 9/2/10 10:21 AM, Brandon Evans wrote:
Are you using rsync replication or the built in replication available 
in solr 1.4?  I have a patch that allows easily allows the --bwlimit 
option to be added to the rsyncd command line.


Either way I agree that a way to throttle the replication bandwidth 
would be nice.


-brandon

On 9/2/10 7:41 AM, Mark wrote:

Is there any way or forthcoming patch that would allow configuration of
how much network bandwith (and ultimately disk I/O) a slave is allowed
during replication? We have the current problem of while replicating our
disk I/O goes through the roof. I would much rather have the replication
take 2x as long with half the disk I/O? Any thoughts?

Thanks
I am using the built in replication. Can you send me a link to the patch 
so I can give it a try? Thanks


Re: Solr crawls during replication

2010-09-03 Thread Mark

 On 9/3/10 11:37 AM, Jonathan Rochkind wrote:

Is the OS disk cache something you configure, or something the OS just does 
automatically based on available free RAM?  Or does it depend on the exact OS?  
Thinking about the OS disk cache is new to me. Thanks for any tips.

From: Shawn Heisey [s...@elyograg.org]
Sent: Friday, September 03, 2010 1:46 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr crawls during replication

   On 9/2/2010 9:31 AM, Mark wrote:

Thanks for the suggestions. Our slaves have 12G with 10G dedicated to
the JVM.. too much?

Are the rysnc snappuller featurs still available in 1.4.1? I may try
that to see if helps. Configuration of the switches may also be possible.

Also, would you mind explaining your second point... using dual NIC
cards. How can this be accomplished/configured. Thanks for you help

I will first admit that I am a relative newbie at this whole thing, so
find yourself a grain of salt before you read further ...

While it's probably not a bad idea to change to an rsync method and
implement bandwidth throttling, I'm betting the real root of your issue
is that you're low on memory, making your disk cache too small.  When
you do a replication, the simple act of copying the data shoves the
current index completely out of RAM, so when you do a query, it has to
go back to the disk (which is now VERY busy) to satisfy it.

Unless you know for sure that you need 10GB dedicated to the JVM, go
with much smaller values, because out of the 12GB available, that will
only leave you about 1.5GB, assuming the machine has no GUI and no other
processes.  If you need the JVM that large because you have very large
Solr caches, consider reducing their size dramatically.  In deciding
whether to use precious memory for the OS disk cache or Solr caches, the
OS should go first.  Additionally, If you have large Solr caches with a
small disk cache and configure large autowarm counts, you end up with
extremely long commit times.

I don't know how the 30GB of data in your index is distributed among the
various Lucene files, but for an index that size, I'd want to have
between 8GB and 16GB of RAM available to the OS just for disk caching,
and if more is possible, even better.  If you could get more than 32GB
of RAM in the server, your entire index would fit, and it would be very
fast.

With a little research, I came up (on my own) with what I think is a
decent rule of thumb, and I'm curious what the experts think of this
idea:  Find out how much space is taken by the index files with the
following extensions: fnm, fdx, frq, nrm, tii, tis, and tvx.  Think of
that as a bare minimum disk cache size, then shoot for between 1.5 and 3
times that value for your disk cache, so it can also cache parts of the
other files.

Thanks,
Shawn


Ditto on that question


Solr Cloud Architecture and DIH

2012-12-19 Thread Mark
We're currently running Solr 3.5 and our indexing process works as follows:  

We have a master that has a cron job to run a delta import via DIH every 5 
minutes. The delta-import  takes around 75 minutes to full complete, most of 
that is due to optimization after each delta and then the slaves sync up. Our 
index is around 30 gigs so after delta-importing it takes a few minutes to sync 
to each slave and causes a huge increase in disk I/O and thus slowing down the 
machine to an unusable state. To get around this we have a rolling upgrade 
process whereas one slave at a time takes itself offline and then syncs and 
then brings itself back up. Gross… i know. When we want to run a full-import, 
which could take upwards of 30 hours, we run it on a separate solr master while 
the first solr master continues to delta-import. When the staging solr master 
is finally done importing we copy over the index to the main solr master which 
will then sync up with the slaves. This has been working for us but it 
obviously has it flaws.

I've been looking into completely re-writing our architecture to utilize Solr 
Cloud to help us with some of these pain points, if it makes sense. Please let 
me know how Solr 4.0 and Solr Cloud could help. 

I also have the following questions.
Does DIH work with Solr Cloud?
Can Solr Cloud utilize the whole cluster to index in parallel to remove the 
burden of one machine from performing that task. If so, how is it balanced 
across all nodes? Can this work with DIH
When we decide to run a full-import how can we due this and not affect our 
existing cluster since there is no real master/slave and obviously no staging 
"master"?

Thanks in advance!

- M

Re: Need help with graphing function (MATH)

2012-02-14 Thread Mark
Thanks I'll have a look at this. I should have mentioned that the actual 
values on the graph aren't important rather I was showing an example of 
how the function should behave.


On 2/13/12 6:25 PM, Kent Fitch wrote:

Hi, assuming you have x and want to generate y, then maybe

- if x < 50, y = 150

- if x > 175, y = 60

- otherwise :

either y = (100/(e^((x -50)/75)^2)) + 50
http://www.wolframalpha.com/input/?i=plot++%28100%2F%28e^%28%28x+-50%29%2F75%29^2%29%29+%2B+50%2C+x%3D50..175


- or maybe y =sin((x+5)/38)*42+105

http://www.wolframalpha.com/input/?i=plot++sin%28%28x%2B5%29%2F38%29*42%2B105%2C+x%3D50..175

Regards,

Kent Fitch

On Tue, Feb 14, 2012 at 12:29 PM, Mark <mailto:static.void@gmail.com>> wrote:


I need some help with one of my boost functions. I would like the
function to look something like the following mockup below. Starts
off flat then there is a gradual decline, steep decline then
gradual decline and then back to flat.

Can some of you math guys please help :)

Thanks.






Re: Need help with graphing function (MATH)

2012-02-14 Thread Mark
Would you mind throwing out an example of these types of functions. 
Looking at Wikipedia (http://en.wikipedia.org/wiki/Probit) its seems 
like the Probit function is very similar to what I want.


Thanks

On 2/14/12 10:56 AM, Ted Dunning wrote:

In general this kind of function is very easy to construct using sums of basic 
sigmoidal functions. The logistic and probit functions are commonly used for 
this.

Sent from my iPhone

On Feb 14, 2012, at 10:05, Mark  wrote:


Thanks I'll have a look at this. I should have mentioned that the actual values 
on the graph aren't important rather I was showing an example of how the 
function should behave.

On 2/13/12 6:25 PM, Kent Fitch wrote:

Hi, assuming you have x and want to generate y, then maybe

- if x<  50, y = 150

- if x>  175, y = 60

- otherwise :

either y = (100/(e^((x -50)/75)^2)) + 50
http://www.wolframalpha.com/input/?i=plot++%28100%2F%28e^%28%28x+-50%29%2F75%29^2%29%29+%2B+50%2C+x%3D50..175


- or maybe y =sin((x+5)/38)*42+105

http://www.wolframalpha.com/input/?i=plot++sin%28%28x%2B5%29%2F38%29*42%2B105%2C+x%3D50..175

Regards,

Kent Fitch

On Tue, Feb 14, 2012 at 12:29 PM, 
Markmailto:static.void@gmail.com>>  wrote:

I need some help with one of my boost functions. I would like the
function to look something like the following mockup below. Starts
off flat then there is a gradual decline, steep decline then
gradual decline and then back to flat.

Can some of you math guys please help :)

Thanks.






Re: Need help with graphing function (MATH)

2012-02-14 Thread Mark

Or better yet an example in solr would be best :)

Thanks!

On 2/14/12 11:05 AM, Mark wrote:
Would you mind throwing out an example of these types of functions. 
Looking at Wikipedia (http://en.wikipedia.org/wiki/Probit) its seems 
like the Probit function is very similar to what I want.


Thanks

On 2/14/12 10:56 AM, Ted Dunning wrote:
In general this kind of function is very easy to construct using sums 
of basic sigmoidal functions. The logistic and probit functions are 
commonly used for this.


Sent from my iPhone

On Feb 14, 2012, at 10:05, Mark  wrote:

Thanks I'll have a look at this. I should have mentioned that the 
actual values on the graph aren't important rather I was showing an 
example of how the function should behave.


On 2/13/12 6:25 PM, Kent Fitch wrote:

Hi, assuming you have x and want to generate y, then maybe

- if x<  50, y = 150

- if x>  175, y = 60

- otherwise :

either y = (100/(e^((x -50)/75)^2)) + 50
http://www.wolframalpha.com/input/?i=plot++%28100%2F%28e^%28%28x+-50%29%2F75%29^2%29%29+%2B+50%2C+x%3D50..175 




- or maybe y =sin((x+5)/38)*42+105

http://www.wolframalpha.com/input/?i=plot++sin%28%28x%2B5%29%2F38%29*42%2B105%2C+x%3D50..175 



Regards,

Kent Fitch

On Tue, Feb 14, 2012 at 12:29 PM, 
Markmailto:static.void@gmail.com>>  
wrote:


I need some help with one of my boost functions. I would like the
function to look something like the following mockup below. Starts
off flat then there is a gradual decline, steep decline then
gradual decline and then back to flat.

Can some of you math guys please help :)

Thanks.






Question on replication

2010-11-22 Thread Mark
After I perform a delta-import on my master the slave replicates the 
whole index which can be quite time consuming. Is there any way for the 
slave to replicate only partials that have changed? Do I need to change 
some setting on master not to commit/optimize to get this to work?


Thanks


Solr DataImportHandler (DIH) and Cassandra

2010-11-29 Thread Mark

Is there anyway to use DIH to import from Cassandra? Thanks


Re: Solr DataImportHandler (DIH) and Cassandra

2010-11-29 Thread Mark
The DataSource subclass route is what I will probably be interested in. 
Are there are working examples of this already out there?


On 11/29/10 12:32 PM, Aaron Morton wrote:

AFAIK there is nothing pre-written to pull the data out for you.

You should be able to create your DataSource sub class 
http://lucene.apache.org/solr/api/org/apache/solr/handler/dataimport/DataSource.html Using 
the Hector java library to pull data from Cassandra.


I'm guessing you will need to consider how to perform delta imports. 
Perhaps using the secondary indexes in 0.7* , or maintaining your own 
queues or indexes to know what has changed.


There is also the Lucandra project, not exactly what your after but 
may be of interest anyway https://github.com/tjake/Lucandra


Hope that helps.
Aaron


On 30 Nov, 2010,at 05:04 AM, Mark  wrote:


Is there anyway to use DIH to import from Cassandra? Thanks


Limit number of characters returned

2010-12-02 Thread Mark

Is there way to limit the number of characters returned from a stored field?

For example:

Say I have a document (~2K words) and I search for a word that's 
somewhere in the middle. I would like the document to match the search 
query but the stored field should only return the first 200 characters 
of the document. Is there anyway to accomplish this that doesn't involve 
two fields?


Thanks


Re: Limit number of characters returned

2010-12-03 Thread Mark
Correct me if I am wrong but I would like to return highlighted excerpts 
from the document so I would still need to index and store the whole 
document right (ie.. highlighting only works on stored fields)?


On 12/3/10 3:51 AM, Ahmet Arslan wrote:


--- On Fri, 12/3/10, Mark  wrote:


From: Mark
Subject: Limit number of characters returned
To: solr-user@lucene.apache.org
Date: Friday, December 3, 2010, 5:39 AM
Is there way to limit the number of
characters returned from a stored field?

For example:

Say I have a document (~2K words) and I search for a word
that's somewhere in the middle. I would like the document to
match the search query but the stored field should only
return the first 200 characters of the document. Is there
anyway to accomplish this that doesn't involve two fields?

I don't think it is possible out-of-the-box. May be you can hack highlighter to 
return that first 200 characters in highlighting response.
Or a custom response writer can do that.

But if you will be always returning first 200 characters of documents, I think creating additional field with 
indexed="false" stored="true" will be more efficient. And you can make your original field 
indexed="true" stored="false", your index size will be diminished.







Negative fl param

2010-12-03 Thread Mark
When returning results is there a way I can say to return all fields 
except a certain one?


So say I have stored fields foo, bar and baz but I only want to return 
foo and bar. Is it possible to do this without specifically listing out 
the fields I do want?


Re: Limit number of characters returned

2010-12-03 Thread Mark

Thanks for the response.

Couldn't I just use the highlighter and configure it to use the 
alternative field to return the first 200 characters?  In cases where 
there is a highlighter match I would prefer to show the excerpts anyway.


http://wiki.apache.org/solr/HighlightingParameters#hl.alternateField
http://wiki.apache.org/solr/HighlightingParameters#hl.maxAlternateFieldLength

Is this something wrong with this method?

On 12/3/10 8:03 AM, Erick Erickson wrote:

Yep, you're correct. CopyField is probably your simplest option here as
Ahmet suggested.
A more complex solution would be your own response writer, but unless and
until you
index gets cumbersome, I'd avoid that. Plus, storing the copied contents
only shouldn't
impact search much, since this doesn't add any terms...

Best
Erick

On Fri, Dec 3, 2010 at 10:32 AM, Mark  wrote:


Correct me if I am wrong but I would like to return highlighted excerpts
from the document so I would still need to index and store the whole
document right (ie.. highlighting only works on stored fields)?


On 12/3/10 3:51 AM, Ahmet Arslan wrote:


--- On Fri, 12/3/10, Mark   wrote:

  From: Mark

Subject: Limit number of characters returned
To: solr-user@lucene.apache.org
Date: Friday, December 3, 2010, 5:39 AM
Is there way to limit the number of
characters returned from a stored field?

For example:

Say I have a document (~2K words) and I search for a word
that's somewhere in the middle. I would like the document to
match the search query but the stored field should only
return the first 200 characters of the document. Is there
anyway to accomplish this that doesn't involve two fields?


I don't think it is possible out-of-the-box. May be you can hack
highlighter to return that first 200 characters in highlighting response.
Or a custom response writer can do that.

But if you will be always returning first 200 characters of documents, I
think creating additional field with indexed="false" stored="true" will be
more efficient. And you can make your original field indexed="true"
stored="false", your index size will be diminished.








Re: Negative fl param

2010-12-03 Thread Mark
Ok simple enough. I just created a SearchComponent that removes values 
from the fl param.


On 12/3/10 9:32 AM, Ahmet Arslan wrote:

When returning results is there a way
I can say to return all fields except a certain one?

So say I have stored fields foo, bar and baz but I only
want to return foo and bar. Is it possible to do this
without specifically listing out the fields I do want?


There were a similar discussion. http://search-lucene.com/m/2qJaU1wImo3/

A workaround can be getting all stored field names from 
http://wiki.apache.org/solr/LukeRequestHandler and construct fl accordingly.





Highlighting parameters

2010-12-03 Thread Mark

Is there a way I can specify separate configuration for 2 different fields.

For field 1 I wan to display only 100 chars, Field 2 200 chars




Solr Newbie - need a point in the right direction

2010-12-06 Thread Mark
Hi,

First time poster here - I'm not entirely sure where I need to look for this
information.

What I'm trying to do is extract some (presumably) structured information
from non-uniform data (eg, prices from a nutch crawl) that needs to show in
search queries, and I've come up against a wall.

I've been unable to figure out where is the best place to begin.

I had a look through the solr wiki and did a search via Lucid's search tool
and I'm guessing this is handled at index time through my schema? But I've
also seen dismax being thrown around as a possible solution and this has
confused me.

Basically, if you guys could point me in the right direction for resources
(even as much as saying, you need X, it's over there) that would be a huge
help.

Cheers

Mark


Re: Solr Newbie - need a point in the right direction

2010-12-07 Thread Mark
Thanks to everyone who responded, no wonder I was getting confused, I was
completely focusing on the wrong half of the equation.

I had a cursory look through some of the Nutch documentation available and
it is looking promising.

Thanks everyone.

Mark

On Tue, Dec 7, 2010 at 10:19 PM, webdev1977  wrote:

>
> I my experience, the hardest (but most flexible part) is exactly what was
> mentioned.. processing the data.  Nutch does have a really easy plugin
> interface that you can use, and the example plugin is a great place to
> start.  Once you have the raw parsed text, you can do what ever you want
> with it.  For example, I wrote a  plugin to add geospatial information to
> my
> NutchDocument.  You then map the fields you added in the NutchDocument to
> something you want to have Solr index.  In my case I created a geography
> field where I put lat, lon info.  Then you create that same geography field
> in the nutch to solr mapping file as well as your solr schema.xml file.
> Then, when you run the crawl and tell it to use "solrindex" it will send
> the
> document to solr to be indexed.  Since you have your new field in the
> schema, it knows what to do with it at index time.  Now you can build a
> user
> interface around what you want to do with that field.
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Newbie-need-a-point-in-the-right-direction-tp2031381p2033687.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Warming searchers/Caching

2010-12-07 Thread Mark
Is there any plugin or easy way to auto-warm/cache a new searcher with a 
bunch of searches read from a file? I know this can be accomplished 
using the EventListeners (newSearcher, firstSearcher) but I rather not 
add 100+ queries to my solrconfig.xml.


If there is no hook/listener available, is there some sort of Handler 
that performs this sort of function? Thanks!


Re: Warming searchers/Caching

2010-12-07 Thread Mark

Maybe I should explain my problem a little more in detail.

The problem we are experiencing is after a delta-import we notice a 
extremely high load time on the slave machines that just replicated. It 
goes away after a min or so production traffic once everything is cached.


I already have a before/after hook that is in place before/after 
replication takes place. The before hook removes the slave from the 
cluster and then starts to replicate. When its done it calls the after 
hook and I would like to warm up the cache in this method so no users 
experience extremely long wait times.


On 12/7/10 4:22 PM, Markus Jelsma wrote:

XInclude works fine but that's not what your looking for i guess. Having the
100 top queries is overkill anyway and it can take too long for a new searcher
to warmup.

Depending on the type of requests, i usually tend to limit warming to popular
filter queries only as they generate a very high hit ratio at make caching
useful [1].

If there are very popular user entered queries having a high initial latency,
i'd have them warmed up as well.

[1]: http://wiki.apache.org/solr/SolrCaching#Tradeoffs


Warning: I haven't used this personally, but Xinclude looks like what
you're after, see: http://wiki.apache.org/solr/SolrConfigXml#XInclude



Best
Erick

On Tue, Dec 7, 2010 at 6:33 PM, Mark  wrote:

Is there any plugin or easy way to auto-warm/cache a new searcher with a
bunch of searches read from a file? I know this can be accomplished using
the EventListeners (newSearcher, firstSearcher) but I rather not add 100+
queries to my solrconfig.xml.

If there is no hook/listener available, is there some sort of Handler
that performs this sort of function? Thanks!


Re: Warming searchers/Caching

2010-12-08 Thread Mark

I am using 1.4.1.

What am I doing that Solr already provides?

Thanks for you help

On 12/8/10 5:10 AM, Erick Erickson wrote:

What version of Solr are you using? Because it seems like you're doing
a lot of stuff that Solr already does for you automatically

So perhaps a more complete statement of your setup is in order, since
we seem to be talking past each other.

Best
Erick

On Tue, Dec 7, 2010 at 10:24 PM, Mark  wrote:


Maybe I should explain my problem a little more in detail.

The problem we are experiencing is after a delta-import we notice a
extremely high load time on the slave machines that just replicated. It goes
away after a min or so production traffic once everything is cached.

I already have a before/after hook that is in place before/after
replication takes place. The before hook removes the slave from the cluster
and then starts to replicate. When its done it calls the after hook and I
would like to warm up the cache in this method so no users experience
extremely long wait times.


On 12/7/10 4:22 PM, Markus Jelsma wrote:


XInclude works fine but that's not what your looking for i guess. Having
the
100 top queries is overkill anyway and it can take too long for a new
searcher
to warmup.

Depending on the type of requests, i usually tend to limit warming to
popular
filter queries only as they generate a very high hit ratio at make caching
useful [1].

If there are very popular user entered queries having a high initial
latency,
i'd have them warmed up as well.

[1]: http://wiki.apache.org/solr/SolrCaching#Tradeoffs

  Warning: I haven't used this personally, but Xinclude looks like what

you're after, see: http://wiki.apache.org/solr/SolrConfigXml#XInclude



Best
Erick

On Tue, Dec 7, 2010 at 6:33 PM, Mark   wrote:


Is there any plugin or easy way to auto-warm/cache a new searcher with a
bunch of searches read from a file? I know this can be accomplished
using
the EventListeners (newSearcher, firstSearcher) but I rather not add
100+
queries to my solrconfig.xml.

If there is no hook/listener available, is there some sort of Handler
that performs this sort of function? Thanks!



Re: Warming searchers/Caching

2010-12-08 Thread Mark
We only replicate twice an hour so we are far from real-time indexing. 
Our application never writes to master rather we just pick up all 
changes using updated_at timestamps when delta-importing using DIH.


We don't have any warming queries in firstSearcher/newSearcher event 
listeners. My initial post was asking how I would go about doing this 
with a large number of queries. Our queries themselves tend to have a 
lot of faceting and other restrictions on them so I would rather not 
list them all out using xml. I was hoping there was some sort of log 
replayer handler or class that would replay a bunch of queries while the 
node is offline. When its done, it will bring the node back online ready 
to serve requests.


On 12/8/10 6:15 AM, Jonathan Rochkind wrote:

How often do you replicate? Do you know how long your warming queries take to 
complete?

As others in this thread have mentioned, if your replications (or ordinary 
commits, if you weren't using replication) happen quicker than warming takes to 
complete, you can get overlapping indexes being warmed up, and run out of RAM 
(causing garbage collection to take lots of CPU, if not an out-of-memory 
error), or otherwise block on CPU with lots of new indexes being warmed at once.

Solr is not very good at providing 'real time indexing' for this reason, 
although I believe there are some features in post-1.4 trunk meant to support 
'near real time search' better.
____
From: Mark [static.void@gmail.com]
Sent: Tuesday, December 07, 2010 10:24 PM
To: solr-user@lucene.apache.org
Subject: Re: Warming searchers/Caching

Maybe I should explain my problem a little more in detail.

The problem we are experiencing is after a delta-import we notice a
extremely high load time on the slave machines that just replicated. It
goes away after a min or so production traffic once everything is cached.

I already have a before/after hook that is in place before/after
replication takes place. The before hook removes the slave from the
cluster and then starts to replicate. When its done it calls the after
hook and I would like to warm up the cache in this method so no users
experience extremely long wait times.

On 12/7/10 4:22 PM, Markus Jelsma wrote:

XInclude works fine but that's not what your looking for i guess. Having the
100 top queries is overkill anyway and it can take too long for a new searcher
to warmup.

Depending on the type of requests, i usually tend to limit warming to popular
filter queries only as they generate a very high hit ratio at make caching
useful [1].

If there are very popular user entered queries having a high initial latency,
i'd have them warmed up as well.

[1]: http://wiki.apache.org/solr/SolrCaching#Tradeoffs


Warning: I haven't used this personally, but Xinclude looks like what
you're after, see: http://wiki.apache.org/solr/SolrConfigXml#XInclude



Best
Erick

On Tue, Dec 7, 2010 at 6:33 PM, Mark   wrote:

Is there any plugin or easy way to auto-warm/cache a new searcher with a
bunch of searches read from a file? I know this can be accomplished using
the EventListeners (newSearcher, firstSearcher) but I rather not add 100+
queries to my solrconfig.xml.

If there is no hook/listener available, is there some sort of Handler
that performs this sort of function? Thanks!


Re: Warming searchers/Caching

2010-12-08 Thread Mark
I actually built in the before/after hooks so we can disable/enable a 
node from the cluster while its replicating. When the machine was 
copying over 20gigs and serving requests the load spiked tremendously. 
It was easy enough to create a sort of rolling replication... ie,


1) node 1 removes health-check file, replicates then goes back up
2) node 2 removes health-check file, replicates then goes back up,
...

Which listener gets called after replication... im guessing newSearcher?

Thanks for you help

On 12/8/10 10:18 AM, Erick Erickson wrote:

Perhaps the tricky part here is that Solr makes it's caches for #parts# of
the query. In other words, a query that sorts on field A will populate
the cache for field A. Any other query that sorts on field A will use the
same cache. So you really need just enough queries to populate, in this
case, the fields you'll sort by. One could put together multiple sorts on a
single query and populate the sort caches all at once if you wanted.

Similarly for faceting and filter queries. You might well be able to make
just a few queries that filled up all the relevant caches rather than the
using 100s, but you know your schema way better than I do.

What I meant about replicating work is that trying to use your after
hook to fire off the queries probably doesn't buy you anything
over firstSearcher/newSearcher lists.

All that said, though, if you really don't want to put your queries in
the config file, it would be relatively trivial to write a small Java app
that uses SolrJ to query the server, reading the queries from
anyplace you chose and call it from the after hook. Personally, I
think this is a high-cost option when compared to having the list
in the config file due to the added complexity, but that's your
call.

Best
Erick

On Wed, Dec 8, 2010 at 12:25 PM, Mark  wrote:


We only replicate twice an hour so we are far from real-time indexing. Our
application never writes to master rather we just pick up all changes using
updated_at timestamps when delta-importing using DIH.

We don't have any warming queries in firstSearcher/newSearcher event
listeners. My initial post was asking how I would go about doing this with a
large number of queries. Our queries themselves tend to have a lot of
faceting and other restrictions on them so I would rather not list them all
out using xml. I was hoping there was some sort of log replayer handler or
class that would replay a bunch of queries while the node is offline. When
its done, it will bring the node back online ready to serve requests.


On 12/8/10 6:15 AM, Jonathan Rochkind wrote:


How often do you replicate? Do you know how long your warming queries take
to complete?

As others in this thread have mentioned, if your replications (or ordinary
commits, if you weren't using replication) happen quicker than warming takes
to complete, you can get overlapping indexes being warmed up, and run out of
RAM (causing garbage collection to take lots of CPU, if not an out-of-memory
error), or otherwise block on CPU with lots of new indexes being warmed at
once.

Solr is not very good at providing 'real time indexing' for this reason,
although I believe there are some features in post-1.4 trunk meant to
support 'near real time search' better.

From: Mark [static.void@gmail.com]
Sent: Tuesday, December 07, 2010 10:24 PM
To: solr-user@lucene.apache.org
Subject: Re: Warming searchers/Caching

Maybe I should explain my problem a little more in detail.

The problem we are experiencing is after a delta-import we notice a
extremely high load time on the slave machines that just replicated. It
goes away after a min or so production traffic once everything is cached.

I already have a before/after hook that is in place before/after
replication takes place. The before hook removes the slave from the
cluster and then starts to replicate. When its done it calls the after
hook and I would like to warm up the cache in this method so no users
experience extremely long wait times.

On 12/7/10 4:22 PM, Markus Jelsma wrote:


XInclude works fine but that's not what your looking for i guess. Having
the
100 top queries is overkill anyway and it can take too long for a new
searcher
to warmup.

Depending on the type of requests, i usually tend to limit warming to
popular
filter queries only as they generate a very high hit ratio at make
caching
useful [1].

If there are very popular user entered queries having a high initial
latency,
i'd have them warmed up as well.

[1]: http://wiki.apache.org/solr/SolrCaching#Tradeoffs

  Warning: I haven't used this personally, but Xinclude looks like what

you're after, see: http://wiki.apache.org/solr/SolrConfigXml#XInclude



Best
Erick

On Tue, Dec 7, 2010 at 6:33 PM, Mark
wrote:


Is there any plugin or easy way to auto-warm/cache a new searcher with
a
bunch of searches read from a file? I know this ca

Re: Warming searchers/Caching

2010-12-09 Thread Mark
Our machines have around 8gb of ram and our index is 25gb. What are some 
good values for those cache settings. Looks like we have the defaults in 
place...


size="16384"
initialSize="4096"
autowarmCount="1024"


You are correct, I am just removing the health-check file and our 
loadbalancer prevents any traffic from reaching those nodes while they 
are replicating.


On 12/8/10 4:41 PM, Chris Hostetter wrote:

: What am I doing that Solr already provides?

the one thing i haven't seen mentioned anywhere in this thread is what you
have the "autoWarmCount" value set to on all of the various solr internal
caches (as seen in your solrconfig.xml)

if that's set, you don't need to manually feed solr any special queries,
it will warm them automaticly when a newSearcher is opened.

this assumes of course that the SolrCore has old caches to warm forom --
ie: if you use normal replication with an existing SolrCore.

You've made refrences to taking slaves out of clusters using before/after
hooks of your own creation -- as long as this is just stoping traffic from
reaching the slave then auto warning should work fine for you -- if you
are actually shutingdown the SolrCore and starting up a new one, then it
wont -- and you are probably making extra work for yourself.


-Hoss


Very high load after replicating

2010-12-12 Thread Mark
After replicating an index of around 20g my slaves experience very high 
load (50+!!)


Is there anything I can do to alleviate this problem?  Would solr cloud 
be of any help?


thanks


Re: Very high load after replicating

2010-12-13 Thread Mark

Markus,

My configuration is as follows...






...
false
2
...
false
64
10
false
true

No cache warming queries and our machines have 8g of memory in them with 
about 5120m of ram dedicated to so Solr. When our index is around 10-11g 
in size everything runs smoothly. At around 20g+ it just falls apart.


Can you (or anyone) provide some suggestions? Thanks


On 12/12/10 1:11 PM, Markus Jelsma wrote:

There can be numerous explanations such as your configuration (cache warm
queries, merge factor, replication events etc) but also I/O having trouble
flushing everything to disk. It could also be a memory problem, the OS might
start swapping if you allocate too much RAM to the JVM leaving little for the
OS to work with.

You need to provide more details.


After replicating an index of around 20g my slaves experience very high
load (50+!!)

Is there anything I can do to alleviate this problem?  Would solr cloud
be of any help?

thanks


Re: Very high load

2010-12-13 Thread Mark
Changing the subject. Its not related to after replication. It only 
appeared after indexing an extra field which increased our index size 
from 12g to 20g+


On 12/13/10 7:57 AM, Mark wrote:

Markus,

My configuration is as follows...






...
false
2
...
false
64
10
false
true

No cache warming queries and our machines have 8g of memory in them 
with about 5120m of ram dedicated to so Solr. When our index is around 
10-11g in size everything runs smoothly. At around 20g+ it just falls 
apart.


Can you (or anyone) provide some suggestions? Thanks


On 12/12/10 1:11 PM, Markus Jelsma wrote:
There can be numerous explanations such as your configuration (cache 
warm
queries, merge factor, replication events etc) but also I/O having 
trouble
flushing everything to disk. It could also be a memory problem, the 
OS might
start swapping if you allocate too much RAM to the JVM leaving little 
for the

OS to work with.

You need to provide more details.


After replicating an index of around 20g my slaves experience very high
load (50+!!)

Is there anything I can do to alleviate this problem?  Would solr cloud
be of any help?

thanks


Need some guidance on solr-config settings

2010-12-14 Thread Mark
Can anyone offer some advice on what some good settings would be for an 
index or around 6 million documents totaling around 20-25gb? It seems 
like when our index gets to this size our CPU load spikes tremendously.


What would be some appropriate settings for ramBufferSize and 
mergeFactor? We currently have:


10
64

Same question on cache settings. We currently have:







false
2

Are there any other settings that I could tweak to affect performance?

Thanks


Re: Need some guidance on solr-config settings

2010-12-14 Thread Mark

Excellent reply.

You mentioned: "I've been experimenting with FastLRUCache versus 
LRUCache, because I read that below a certain hitratio, the latter is 
better."


Do you happen to remember what that threshold is? Thanks

On 12/14/10 7:59 AM, Shawn Heisey wrote:

On 12/14/2010 8:31 AM, Mark wrote:
Can anyone offer some advice on what some good settings would be for 
an index or around 6 million documents totaling around 20-25gb? It 
seems like when our index gets to this size our CPU load spikes 
tremendously.


If you are adding, deleting, or updating documents on a regular basis, 
I would bet that it's your autoWarmCount.  You've told it that 
whenever you do a commit, it needs to make up to 32768 queries against 
the new index.  That's very intense and time-consuming.  If you are 
also optimizing the index, the problem gets even worse.  On the 
documentCache, autowarm doesn't happen, so the 16384 specified there 
isn't actually doing anything.


Below are my settings.  I originally had much larger caches with 
equally large autoWarmCounts ... reducing them to this level was the 
only way I could get my autowarm time below 30 seconds on each index.  
If you go to the admin page for your index and click on Statistics, 
then search for "warmupTime" you'll see how long it took to do the 
queries.  Later on the page you'll also see this broken down on each 
cache.


Since I made the changes, performance is actually better now, not 
worse.  I have been experimenting with FastLRUCache versus  LRUCache, 
because I read that below a certain hitratio, the latter is better.  
I've got 8 million documents in each shard, taking up about 15GB.


My mergeFactor is 16 and my ramBufferSize is 256MB.  These really only 
come into play when I do a full re-index, which is rare.











DIH and UTF-8

2010-12-27 Thread Mark
Seems like I am missing some configuration when trying to use DIH to 
import documents with chinese characters. All the documents save crazy 
nonsense like "这是测试" instead of actual chinese characters.


I think its at the JDBC level because if I hardcode one of the fields 
within data-config.xml (using a template transformer) the characters 
show up correctly.


Any ideas? Thanks


Re: DIH and UTF-8

2010-12-27 Thread Mark

Solr: 1.4.1
JDBC driver: Connector/J 5.1.14

Looks like its the JDBC driver because It doesn't even work with a 
simple java program. I know this is a little off subject now, but do you 
have any clues? Thanks again



On 12/27/10 1:58 PM, Erick Erickson wrote:

More data please.

Which jdbc driver? Have you tried just printing out the results of using
that
driver in a simple Java program?

Solr should handle UTF-8 just fine, but the servlet container may have to
have some settings tweaked, which one of those are you using?

What version of Solr?

Best
Erick

On Mon, Dec 27, 2010 at 3:05 PM, Mark  wrote:


Seems like I am missing some configuration when trying to use DIH to import
documents with chinese characters. All the documents save crazy nonsense
like "这是测试" instead of actual chinese characters.

I think its at the JDBC level because if I hardcode one of the fields
within data-config.xml (using a template transformer) the characters show up
correctly.

Any ideas? Thanks



Re: DIH and UTF-8

2010-12-27 Thread Mark

I tried both of those with no such luck.

On 12/27/10 2:49 PM, Glen Newton wrote:

1 - Verify your mysql is set up using UTF-8
2 - Does your JDBC connect string contain:
useUnicode=true&characterEncoding=UTF-8
See: http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-charsets.html

Glen
http://zzzoot.blogspot.com/

On Mon, Dec 27, 2010 at 5:15 PM, Mark  wrote:

Solr: 1.4.1
JDBC driver: Connector/J 5.1.14

Looks like its the JDBC driver because It doesn't even work with a simple
java program. I know this is a little off subject now, but do you have any
clues? Thanks again


On 12/27/10 1:58 PM, Erick Erickson wrote:

More data please.

Which jdbc driver? Have you tried just printing out the results of using
that
driver in a simple Java program?

Solr should handle UTF-8 just fine, but the servlet container may have to
have some settings tweaked, which one of those are you using?

What version of Solr?

Best
Erick

On Mon, Dec 27, 2010 at 3:05 PM, Markwrote:


Seems like I am missing some configuration when trying to use DIH to
import
documents with chinese characters. All the documents save crazy nonsense
like "这是测试" instead of actual chinese characters.

I think its at the JDBC level because if I hardcode one of the fields
within data-config.xml (using a template transformer) the characters show
up
correctly.

Any ideas? Thanks






Re: DIH and UTF-8

2010-12-27 Thread Mark
Just like the user of that thread... i have my database, table, columns 
and system variables all set but it still doesnt work as expected.


Server version: 5.0.67 Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> SHOW VARIABLES LIKE 'collation%';
+--+-+
| Variable_name| Value   |
+--+-+
| collation_connection | utf8_general_ci |
| collation_database   | utf8_general_ci |
| collation_server | utf8_general_ci |
+--+-+
3 rows in set (0.00 sec)

mysql> SHOW VARIABLES LIKE 'character_set%';
+--++
| Variable_name| Value  |
+--++
| character_set_client | utf8   |
| character_set_connection | utf8   |
| character_set_database   | utf8   |
| character_set_filesystem | binary |
| character_set_results| utf8   |
| character_set_server | utf8   |
| character_set_system | utf8   |
| character_sets_dir   | /usr/local/mysql/share/mysql/charsets/ |
+--++
8 rows in set (0.00 sec)


Any other ideas? Thanks


On 12/27/10 3:23 PM, Glen Newton wrote:

[client]
>  default-character-set = utf8
>  [mysql]
>  default-character-set=utf8
>  [mysqld]
>  character_set_server = utf8
>  character_set_client = utf8


Re: DIH and UTF-8

2010-12-28 Thread Mark
It was due to the way I was writing to the DB using our rails 
application. Everythin looked correct but when retrieving it using the 
JDBC driver it was all managled.


On 12/27/10 4:38 PM, Glen Newton wrote:

Is it possible your browser is not set up to properly display the
chinese characters? (I am assuming you are looking at things through
your browser)
Do you have any problems viewing other chinese documents properly in
your browser?
Using mysql, can you see these characters properly?

What happens when you use curl or wget to get a document from solr and
looking at it using something besides your browser?

Yes, I am running out of ideas!  :-)

-Glen

On Mon, Dec 27, 2010 at 7:22 PM, Mark  wrote:

Just like the user of that thread... i have my database, table, columns and
system variables all set but it still doesnt work as expected.

Server version: 5.0.67 Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql>  SHOW VARIABLES LIKE 'collation%';
+--+-+
| Variable_name| Value   |
+--+-+
| collation_connection | utf8_general_ci |
| collation_database   | utf8_general_ci |
| collation_server | utf8_general_ci |
+--+-+
3 rows in set (0.00 sec)

mysql>  SHOW VARIABLES LIKE 'character_set%';
+--++
| Variable_name| Value  |
+--++
| character_set_client | utf8   |
| character_set_connection | utf8   |
| character_set_database   | utf8   |
| character_set_filesystem | binary |
| character_set_results| utf8   |
| character_set_server | utf8   |
| character_set_system | utf8   |
| character_sets_dir   | /usr/local/mysql/share/mysql/charsets/ |
+--++
8 rows in set (0.00 sec)


Any other ideas? Thanks


On 12/27/10 3:23 PM, Glen Newton wrote:

[client]

  default-character-set = utf8
  [mysql]
  default-character-set=utf8
  [mysqld]
  character_set_server = utf8
  character_set_client = utf8





Dynamic column names using DIH

2010-12-28 Thread Mark
Is there a way to create dynamic column names using the values returned 
from the query?


For example:






Re: DIH and UTF-8

2010-12-29 Thread Mark

Sure thing.

In my database.yml I was missing the "encoding: utf8" option.

If one were to add unicode characters within rails (console, web form, 
etc) the characters would appear to be saved correctly... ie when trying 
to retrieve them back, everything looked perfect. The characters also 
appeared correctly using the mysql prompt. However when trying to index 
or retrieve those characters using JDBC/Solr the characters were mangled.


After adding the above utf8 encoding option I was able to correctly save 
utf8 characters into the database and retrieve them using JDBC/Solr. 
However when using the mysql client all the characters would show up as 
all mangled or as ''. This was resolved by running the following 
query "set names utf8;".


On 12/28/10 10:17 PM, Glen Newton wrote:

Hi Mark,

Could you offer a more technical explanation of the Rails problem, so
that if others encounter a similar problem your efforts in finding the
issue will be available to them?  :-)

Thanks,
Glen

PS. This has wandered somewhat off-topic to this list: apologies&
thanks for the patience of this list...

On Tue, Dec 28, 2010 at 4:15 PM, Mark  wrote:

It was due to the way I was writing to the DB using our rails application.
Everythin looked correct but when retrieving it using the JDBC driver it was
all managled.

On 12/27/10 4:38 PM, Glen Newton wrote:

Is it possible your browser is not set up to properly display the
chinese characters? (I am assuming you are looking at things through
your browser)
Do you have any problems viewing other chinese documents properly in
your browser?
Using mysql, can you see these characters properly?

What happens when you use curl or wget to get a document from solr and
looking at it using something besides your browser?

Yes, I am running out of ideas!  :-)

-Glen

On Mon, Dec 27, 2010 at 7:22 PM, Markwrote:

Just like the user of that thread... i have my database, table, columns
and
system variables all set but it still doesnt work as expected.

Server version: 5.0.67 Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql>SHOW VARIABLES LIKE 'collation%';
+--+-+
| Variable_name| Value   |
+--+-+
| collation_connection | utf8_general_ci |
| collation_database   | utf8_general_ci |
| collation_server | utf8_general_ci |
+--+-+
3 rows in set (0.00 sec)

mysql>SHOW VARIABLES LIKE 'character_set%';
+--++
| Variable_name| Value  |
+--++
| character_set_client | utf8   |
| character_set_connection | utf8   |
| character_set_database   | utf8   |
| character_set_filesystem | binary |
| character_set_results| utf8   |
| character_set_server | utf8   |
| character_set_system | utf8   |
| character_sets_dir   | /usr/local/mysql/share/mysql/charsets/ |
+--++
8 rows in set (0.00 sec)


Any other ideas? Thanks


On 12/27/10 3:23 PM, Glen Newton wrote:

[client]

  default-character-set = utf8
  [mysql]
  default-character-set=utf8
  [mysqld]
  character_set_server = utf8
  character_set_client = utf8







Query multiple cores

2010-12-29 Thread Mark

Is it possible to query across multiple cores and combine the results?

If not available out-of-the-box could this be accomplished using some 
sort of custom request handler?


Thanks for any suggestions.


Re: Query multiple cores

2010-12-29 Thread Mark

I own the book already Smiley :)

I'm somewhat familiar with this feature but I wouldn't be searching 
across multiple machines. I would like to search across two separate 
cores on the same machine.


Is distributed search the same as Solr cloud? When would one choose one 
over the other?


On 12/29/10 12:34 PM, Smiley, David W. wrote:

I recommend looking for answers on the wiki (or my book) before asking basic 
questions on the list.  Here you go:
http://wiki.apache.org/solr/DistributedSearch

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/

On Dec 29, 2010, at 3:24 PM, Mark wrote:


Is it possible to query across multiple cores and combine the results?

If not available out-of-the-box could this be accomplished using some
sort of custom request handler?

Thanks for any suggestions.


Question on long delta import

2010-12-30 Thread Mark
When using DIH my delta imports appear to finish quickly.. ie it says 
"Indexing completed. Added/Updated: 95491 documents. Deleted 11148 
documents." in a relatively short amount of time (~30mins).


However the importMessage says "A command is still running..." for a 
really long time (~60mins). What is happening during this phase and how 
could I speed this up?


Thanks!


DIH MySQLNonTransientConnectionException

2011-01-01 Thread Mark
I have recently been receiving the following errors during my DIH 
importing. Has anyone ran into this issue before? Know how to resolve it?


Thanks!

Jan 1, 2011 4:51:06 PM org.apache.solr.handler.dataimport.JdbcDataSource 
closeConnection

SEVERE: Ignoring Error when closing connection
com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: 
Communications link failure during rollback(). Transaction resolution 
unknown.
at sun.reflect.GeneratedConstructorAccessor29.newInstance(Unknown 
Source)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:407)
at com.mysql.jdbc.Util.getInstance(Util.java:382)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1013)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927)
at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:4751)
at com.mysql.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:4345)
at com.mysql.jdbc.ConnectionImpl.close(ConnectionImpl.java:1564)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:399)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:390)
at 
org.apache.solr.handler.dataimport.DataConfig$Entity.clearCache(DataConfig.java:174)
at 
org.apache.solr.handler.dataimport.DataConfig$Entity.clearCache(DataConfig.java:165)
at 
org.apache.solr.handler.dataimport.DataConfig.clearCaches(DataConfig.java:332)
at 
org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:360)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
Jan 1, 2011 4:51:06 PM org.apache.solr.handler.dataimport.JdbcDataSource 
closeConnection

SEVERE: Ignoring Error when closing connection
java.sql.SQLException: Streaming result set 
com.mysql.jdbc.rowdatadyna...@71f18c82 is still active. No statements 
may be issued when any streaming result sets are open and in use on a 
given connection. Ensure that you have called .close() on any active 
streaming result sets before attempting more queries.

at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:934)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:931)
at 
com.mysql.jdbc.MysqlIO.checkForOutstandingStreamingData(MysqlIO.java:2724)

at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1895)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2140)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2620)
at 
com.mysql.jdbc.ConnectionImpl.rollbackNoChecks(ConnectionImpl.java:4854)

at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:4737)
at com.mysql.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:4345)
at com.mysql.jdbc.ConnectionImpl.close(ConnectionImpl.java:1564)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:399)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:390)
at 
org.apache.solr.handler.dataimport.DataConfig$Entity.clearCache(DataConfig.java:174)
at 
org.apache.solr.handler.dataimport.DataConfig.clearCaches(DataConfig.java:332)
at 
org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:360)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)




DIH keeps felling during full-import

2011-02-07 Thread Mark
I'm receiving the following exception when trying to perform a 
full-import (~30 hours). Any idea on ways I could fix this?


Is there an easy way to use DIH to break apart a full-import into 
multiple pieces? IE 3 mini-imports instead of 1 large import?


Thanks.




Feb 7, 2011 5:52:33 AM org.apache.solr.handler.dataimport.JdbcDataSource 
closeConnection

SEVERE: Ignoring Error when closing connection
com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: 
Communications link failure during rollback(). Transaction resolution 
unknown.
at sun.reflect.GeneratedConstructorAccessor27.newInstance(Unknown 
Source)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:407)
at com.mysql.jdbc.Util.getInstance(Util.java:382)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1013)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927)
at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:4751)
at com.mysql.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:4345)
at com.mysql.jdbc.ConnectionImpl.close(ConnectionImpl.java:1564)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:399)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:390)
at 
org.apache.solr.handler.dataimport.DataConfig$Entity.clearCache(DataConfig.java:174)
at 
org.apache.solr.handler.dataimport.DataConfig$Entity.clearCache(DataConfig.java:165)
at 
org.apache.solr.handler.dataimport.DataConfig.clearCaches(DataConfig.java:332)
at 
org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:360)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
Feb 7, 2011 5:52:33 AM org.apache.solr.handler.dataimport.JdbcDataSource 
closeConnection

SEVERE: Ignoring Error when closing connection
java.sql.SQLException: Streaming result set 
com.mysql.jdbc.RowDataDynamic@1a797305 is still active. No statements 
may be issued when any streaming result sets are open and in use on a 
given connection. Ensure that you have called .close() on any active 
streaming result sets before attempting more queries.

at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:934)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:931)
at 
com.mysql.jdbc.MysqlIO.checkForOutstandingStreamingData(MysqlIO.java:2724)

at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1895)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2140)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2620)
at 
com.mysql.jdbc.ConnectionImpl.rollbackNoChecks(ConnectionImpl.java:4854)

at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:4737)
at com.mysql.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:4345)
at com.mysql.jdbc.ConnectionImpl.close(ConnectionImpl.java:1564)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:399)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:390)
at 
org.apache.solr.handler.dataimport.DataConfig$Entity.clearCache(DataConfig.java:174)
at 
org.apache.solr.handler.dataimport.DataConfig.clearCaches(DataConfig.java:332)
at 
org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:360)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
Feb 7, 2011 7:03:29 AM org.apache.solr.handler.dataimport.JdbcDataSource 
closeConnection

SEVERE: Ignoring Error when closing connection
com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: 
Communications link failure during rollback(). Transaction resolution 
unknown.
at sun.reflect.GeneratedConstructorAccessor27.newInstance(Unknown 
Source)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:407)
at com.mysql.jdbc.Util.getInstance(Util.java:382)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1013)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927)
at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:4751)
at com.mysql.jdbc.ConnectionImpl.realCl

Re: DIH keeps failing during full-import

2011-02-07 Thread Mark

Typo in subject

On 2/7/11 7:59 AM, Mark wrote:
I'm receiving the following exception when trying to perform a 
full-import (~30 hours). Any idea on ways I could fix this?


Is there an easy way to use DIH to break apart a full-import into 
multiple pieces? IE 3 mini-imports instead of 1 large import?


Thanks.




Feb 7, 2011 5:52:33 AM 
org.apache.solr.handler.dataimport.JdbcDataSource closeConnection

SEVERE: Ignoring Error when closing connection
com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: 
Communications link failure during rollback(). Transaction resolution 
unknown.
at sun.reflect.GeneratedConstructorAccessor27.newInstance(Unknown 
Source)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:407)
at com.mysql.jdbc.Util.getInstance(Util.java:382)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1013)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927)
at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:4751)
at com.mysql.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:4345)
at com.mysql.jdbc.ConnectionImpl.close(ConnectionImpl.java:1564)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:399)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:390)
at 
org.apache.solr.handler.dataimport.DataConfig$Entity.clearCache(DataConfig.java:174)
at 
org.apache.solr.handler.dataimport.DataConfig$Entity.clearCache(DataConfig.java:165)
at 
org.apache.solr.handler.dataimport.DataConfig.clearCaches(DataConfig.java:332)
at 
org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:360)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
Feb 7, 2011 5:52:33 AM 
org.apache.solr.handler.dataimport.JdbcDataSource closeConnection

SEVERE: Ignoring Error when closing connection
java.sql.SQLException: Streaming result set 
com.mysql.jdbc.RowDataDynamic@1a797305 is still active. No statements 
may be issued when any streaming result sets are open and in use on a 
given connection. Ensure that you have called .close() on any active 
streaming result sets before attempting more queries.

at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:934)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:931)
at 
com.mysql.jdbc.MysqlIO.checkForOutstandingStreamingData(MysqlIO.java:2724)

at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1895)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2140)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2620)
at 
com.mysql.jdbc.ConnectionImpl.rollbackNoChecks(ConnectionImpl.java:4854)

at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:4737)
at com.mysql.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:4345)
at com.mysql.jdbc.ConnectionImpl.close(ConnectionImpl.java:1564)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:399)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:390)
at 
org.apache.solr.handler.dataimport.DataConfig$Entity.clearCache(DataConfig.java:174)
at 
org.apache.solr.handler.dataimport.DataConfig.clearCaches(DataConfig.java:332)
at 
org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:360)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
Feb 7, 2011 7:03:29 AM 
org.apache.solr.handler.dataimport.JdbcDataSource closeConnection

SEVERE: Ignoring Error when closing connection
com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: 
Communications link failure during rollback(). Transaction resolution 
unknown.
at sun.reflect.GeneratedConstructorAccessor27.newInstance(Unknown 
Source)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:407)
at com.mysql.jdbc.Util.getInstance(Util.java:382)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1013)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927)
at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl

Re: DIH keeps felling during full-import

2011-02-07 Thread Mark
Full import is around 6M documents which when completed totals around 
30GB in size.


Im guessing it could be a database connectivity problem because I also 
see these types of errors on delta-imports which could be anywhere from 
20K to 300K records.


On 2/7/11 8:15 AM, Gora Mohanty wrote:

On Mon, Feb 7, 2011 at 9:29 PM, Mark  wrote:

I'm receiving the following exception when trying to perform a full-import
(~30 hours). Any idea on ways I could fix this?

Is there an easy way to use DIH to break apart a full-import into multiple
pieces? IE 3 mini-imports instead of 1 large import?

Thanks.




Feb 7, 2011 5:52:33 AM org.apache.solr.handler.dataimport.JdbcDataSource
closeConnection
SEVERE: Ignoring Error when closing connection
com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException:
Communications link failure during rollback(). Transaction resolution
unknown.

[...]

This looks like a network issue, or some other failure in communicating
with the mysql database. Is that a possibility? Also, how many records
are you importing, what is the data size, what is the quality of the network
connection, etc.?

One way to break up the number of records imported at a time is to
shard your data at at the database level, but the advisability of this
option depends on whether there is a more fundamental issue.

Regards,
Gora


DIH threads

2011-02-18 Thread Mark
Has anyone applied the DIH threads patch on 1.4.1 
(https://issues.apache.org/jira/browse/SOLR-1352)?


Does anyone know if this works and/or does it improve performance?

Thanks




Removing duplicates

2011-02-18 Thread Mark
I know that I can use the SignatureUpdateProcessorFactory to remove 
duplicates but I would like the duplicates in the index but remove them 
conditionally at query time.


Is there any easy way I could accomplish this?


Field Collapsing on 1.4.1

2011-02-19 Thread Mark

Is there a seamless field collapsing patch for 1.4.1?

I see it has been merged into trunk but I tried downloading it to give 
it a whirl but it appears that many things have changed and our 
application would need some considerable work to get it up an running.


Thanks


  1   2   3   4   5   6   7   8   9   10   >