indexing java byte code in classes / jars
I looking to use Solr search over the byte code in Classes and Jars. Does anyone know or have experience of Analyzers, Tokenizers, and Token Filters for such a task? Regards Mark
Re: indexing java byte code in classes / jars
To answer why bytecode - because mostly the use case I have is looking to index as much detail from jars/classes. extract class names, method names signatures packages / imports I am considering using ASM in order to generate an analysis view of the class The sort of usecases I have would be method / signature searches. For example; 1) show any classes with a method named parse* 2) show any classes with a method named parse that passes in a type *json* ...etc In the past I have written something to reverse out javadocs from just java bytecode, using solr would move this idea considerably much more powerful. Thanks for the suggestions so far On 8 May 2015 at 21:19, Erik Hatcher wrote: > Oh, and sorry, I omitted a couple of details: > > # creating the “java” core/collection > bin/solr create -c java > > # I ran this from my Solr source code checkout, so that > SolrLogFormatter.class just happened to be handy > > Erik > > > > > > On May 8, 2015, at 4:11 PM, Erik Hatcher wrote: > > > > What kinds of searches do you want to run? Are you trying to extract > class names, method names, and such and make those searchable? If that’s > the case, you need some kind of “parser” to reverse engineer that > information from .class and .jar files before feeding it to Solr, which > would happen before analysis. Java itself comes with a javap command that > can do this; whether this is the “best” way to go for your scenario I don’t > know, but here’s an interesting example pasted below (using Solr 5.x). > > > > — > > Erik Hatcher, Senior Solutions Architect > > http://www.lucidworks.com > > > > > > javap > build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class > > test.txt > > bin/post -c java test.txt > > > > now search for "coreInfoMap" > http://localhost:8983/solr/java/browse?q=coreInfoMap > > > > I tried to be cleverer and use the stdin option of bin/post, like this: > > javap > build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class | > bin/post -c java -url http://localhost:8983/solr/java/update/extract > -type text/plain -params "literal.id=SolrLogFormatter" -out yes -d > > but something isn’t working right with the stdin detection like that (it > does work to `cat test.txt | bin/post…` though, hmmm) > > > > test.txt looks like this, `cat test.txt`: > > Compiled from "SolrLogFormatter.java" > > public class org.apache.solr.SolrLogFormatter extends > java.util.logging.Formatter { > > long startTime; > > long lastTime; > > java.util.Map java.lang.String> methodAlias; > > public boolean shorterFormat; > > java.util.Map org.apache.solr.SolrLogFormatter$CoreInfo> coreInfoMap; > > public java.util.Map classAliases; > > static java.lang.ThreadLocal threadLocal; > > public org.apache.solr.SolrLogFormatter(); > > public void setShorterFormat(); > > public java.lang.String format(java.util.logging.LogRecord); > > public void appendThread(java.lang.StringBuilder, > java.util.logging.LogRecord); > > public java.lang.String _format(java.util.logging.LogRecord); > > public java.lang.String getHead(java.util.logging.Handler); > > public java.lang.String getTail(java.util.logging.Handler); > > public java.lang.String formatMessage(java.util.logging.LogRecord); > > public static void main(java.lang.String[]) throws java.lang.Exception; > > public static void go() throws java.lang.Exception; > > static {}; > > } > > > >> On May 8, 2015, at 3:31 PM, Mark wrote: > >> > >> I looking to use Solr search over the byte code in Classes and Jars. > >> > >> Does anyone know or have experience of Analyzers, Tokenizers, and Token > >> Filters for such a task? > >> > >> Regards > >> > >> Mark > > > >
Re: indexing java byte code in classes / jars
https://searchcode.com/ looks really interesting, however I want to crunch as much searchable aspects out of jars sititng on a classpath or under a project structure... Really early days so I'm open to any suggestions On 8 May 2015 at 22:09, Mark wrote: > To answer why bytecode - because mostly the use case I have is looking to > index as much detail from jars/classes. > > extract class names, > method names > signatures > packages / imports > > I am considering using ASM in order to generate an analysis view of the > class > > The sort of usecases I have would be method / signature searches. > > For example; > > 1) show any classes with a method named parse* > > 2) show any classes with a method named parse that passes in a type *json* > > ...etc > > In the past I have written something to reverse out javadocs from just > java bytecode, using solr would move this idea considerably much more > powerful. > > Thanks for the suggestions so far > > > > > > > > On 8 May 2015 at 21:19, Erik Hatcher wrote: > >> Oh, and sorry, I omitted a couple of details: >> >> # creating the “java” core/collection >> bin/solr create -c java >> >> # I ran this from my Solr source code checkout, so that >> SolrLogFormatter.class just happened to be handy >> >> Erik >> >> >> >> >> > On May 8, 2015, at 4:11 PM, Erik Hatcher >> wrote: >> > >> > What kinds of searches do you want to run? Are you trying to extract >> class names, method names, and such and make those searchable? If that’s >> the case, you need some kind of “parser” to reverse engineer that >> information from .class and .jar files before feeding it to Solr, which >> would happen before analysis. Java itself comes with a javap command that >> can do this; whether this is the “best” way to go for your scenario I don’t >> know, but here’s an interesting example pasted below (using Solr 5.x). >> > >> > — >> > Erik Hatcher, Senior Solutions Architect >> > http://www.lucidworks.com >> > >> > >> > javap >> build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class > >> test.txt >> > bin/post -c java test.txt >> > >> > now search for "coreInfoMap" >> http://localhost:8983/solr/java/browse?q=coreInfoMap >> > >> > I tried to be cleverer and use the stdin option of bin/post, like this: >> > javap >> build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class | >> bin/post -c java -url http://localhost:8983/solr/java/update/extract >> -type text/plain -params "literal.id=SolrLogFormatter" -out yes -d >> > but something isn’t working right with the stdin detection like that >> (it does work to `cat test.txt | bin/post…` though, hmmm) >> > >> > test.txt looks like this, `cat test.txt`: >> > Compiled from "SolrLogFormatter.java" >> > public class org.apache.solr.SolrLogFormatter extends >> java.util.logging.Formatter { >> > long startTime; >> > long lastTime; >> > java.util.Map> java.lang.String> methodAlias; >> > public boolean shorterFormat; >> > java.util.Map> org.apache.solr.SolrLogFormatter$CoreInfo> coreInfoMap; >> > public java.util.Map classAliases; >> > static java.lang.ThreadLocal threadLocal; >> > public org.apache.solr.SolrLogFormatter(); >> > public void setShorterFormat(); >> > public java.lang.String format(java.util.logging.LogRecord); >> > public void appendThread(java.lang.StringBuilder, >> java.util.logging.LogRecord); >> > public java.lang.String _format(java.util.logging.LogRecord); >> > public java.lang.String getHead(java.util.logging.Handler); >> > public java.lang.String getTail(java.util.logging.Handler); >> > public java.lang.String formatMessage(java.util.logging.LogRecord); >> > public static void main(java.lang.String[]) throws java.lang.Exception; >> > public static void go() throws java.lang.Exception; >> > static {}; >> > } >> > >> >> On May 8, 2015, at 3:31 PM, Mark wrote: >> >> >> >> I looking to use Solr search over the byte code in Classes and Jars. >> >> >> >> Does anyone know or have experience of Analyzers, Tokenizers, and Token >> >> Filters for such a task? >> >> >> >> Regards >> >> >> >> Mark >> > >> >> >
Re: indexing java byte code in classes / jars
Erik, Thanks for the pretty much OOTB approach. I think I'm going to just try a range of approaches, and see how far I get. The "IDE does this suggestion" would be worth looking into as well. On 8 May 2015 at 22:14, Mark wrote: > > https://searchcode.com/ > > looks really interesting, however I want to crunch as much searchable > aspects out of jars sititng on a classpath or under a project structure... > > Really early days so I'm open to any suggestions > > > > On 8 May 2015 at 22:09, Mark wrote: > >> To answer why bytecode - because mostly the use case I have is looking to >> index as much detail from jars/classes. >> >> extract class names, >> method names >> signatures >> packages / imports >> >> I am considering using ASM in order to generate an analysis view of the >> class >> >> The sort of usecases I have would be method / signature searches. >> >> For example; >> >> 1) show any classes with a method named parse* >> >> 2) show any classes with a method named parse that passes in a type *json* >> >> ...etc >> >> In the past I have written something to reverse out javadocs from just >> java bytecode, using solr would move this idea considerably much more >> powerful. >> >> Thanks for the suggestions so far >> >> >> >> >> >> >> >> On 8 May 2015 at 21:19, Erik Hatcher wrote: >> >>> Oh, and sorry, I omitted a couple of details: >>> >>> # creating the “java” core/collection >>> bin/solr create -c java >>> >>> # I ran this from my Solr source code checkout, so that >>> SolrLogFormatter.class just happened to be handy >>> >>> Erik >>> >>> >>> >>> >>> > On May 8, 2015, at 4:11 PM, Erik Hatcher >>> wrote: >>> > >>> > What kinds of searches do you want to run? Are you trying to extract >>> class names, method names, and such and make those searchable? If that’s >>> the case, you need some kind of “parser” to reverse engineer that >>> information from .class and .jar files before feeding it to Solr, which >>> would happen before analysis. Java itself comes with a javap command that >>> can do this; whether this is the “best” way to go for your scenario I don’t >>> know, but here’s an interesting example pasted below (using Solr 5.x). >>> > >>> > — >>> > Erik Hatcher, Senior Solutions Architect >>> > http://www.lucidworks.com >>> > >>> > >>> > javap >>> build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class > >>> test.txt >>> > bin/post -c java test.txt >>> > >>> > now search for "coreInfoMap" >>> http://localhost:8983/solr/java/browse?q=coreInfoMap >>> > >>> > I tried to be cleverer and use the stdin option of bin/post, like this: >>> > javap >>> build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class | >>> bin/post -c java -url http://localhost:8983/solr/java/update/extract >>> -type text/plain -params "literal.id=SolrLogFormatter" -out yes -d >>> > but something isn’t working right with the stdin detection like that >>> (it does work to `cat test.txt | bin/post…` though, hmmm) >>> > >>> > test.txt looks like this, `cat test.txt`: >>> > Compiled from "SolrLogFormatter.java" >>> > public class org.apache.solr.SolrLogFormatter extends >>> java.util.logging.Formatter { >>> > long startTime; >>> > long lastTime; >>> > java.util.Map>> java.lang.String> methodAlias; >>> > public boolean shorterFormat; >>> > java.util.Map>> org.apache.solr.SolrLogFormatter$CoreInfo> coreInfoMap; >>> > public java.util.Map classAliases; >>> > static java.lang.ThreadLocal threadLocal; >>> > public org.apache.solr.SolrLogFormatter(); >>> > public void setShorterFormat(); >>> > public java.lang.String format(java.util.logging.LogRecord); >>> > public void appendThread(java.lang.StringBuilder, >>> java.util.logging.LogRecord); >>> > public java.lang.String _format(java.util.logging.LogRecord); >>> > public java.lang.String getHead(java.util.logging.Handler); >>> > public java.lang.String getTail(java.util.logging.Handler); >>> > public java.lang.String formatMessage(java.util.logging.LogRecord); >>> > public static void main(java.lang.String[]) throws >>> java.lang.Exception; >>> > public static void go() throws java.lang.Exception; >>> > static {}; >>> > } >>> > >>> >> On May 8, 2015, at 3:31 PM, Mark wrote: >>> >> >>> >> I looking to use Solr search over the byte code in Classes and Jars. >>> >> >>> >> Does anyone know or have experience of Analyzers, Tokenizers, and >>> Token >>> >> Filters for such a task? >>> >> >>> >> Regards >>> >> >>> >> Mark >>> > >>> >>> >> >
Re: indexing java byte code in classes / jars
Hi Alexandre, Solr & ASM is the extact poblem I'm looking to hack about with so I'm keen to consider any code no matter how ugly or broken Regards Mark On 9 May 2015 at 10:21, Alexandre Rafalovitch wrote: > If you only have classes/jars, use ASM. I have done this before, have some > ugly code to share if you want. > > If you have sources, javadoc 8 is a good way too. I am doing that now for > solr-start.com, code on Github. > > Regards, > Alex > On 9 May 2015 7:09 am, "Mark" wrote: > > > To answer why bytecode - because mostly the use case I have is looking to > > index as much detail from jars/classes. > > > > extract class names, > > method names > > signatures > > packages / imports > > > > I am considering using ASM in order to generate an analysis view of the > > class > > > > The sort of usecases I have would be method / signature searches. > > > > For example; > > > > 1) show any classes with a method named parse* > > > > 2) show any classes with a method named parse that passes in a type > *json* > > > > ...etc > > > > In the past I have written something to reverse out javadocs from just > java > > bytecode, using solr would move this idea considerably much more > powerful. > > > > Thanks for the suggestions so far > > > > > > > > > > > > > > > > On 8 May 2015 at 21:19, Erik Hatcher wrote: > > > > > Oh, and sorry, I omitted a couple of details: > > > > > > # creating the “java” core/collection > > > bin/solr create -c java > > > > > > # I ran this from my Solr source code checkout, so that > > > SolrLogFormatter.class just happened to be handy > > > > > > Erik > > > > > > > > > > > > > > > > On May 8, 2015, at 4:11 PM, Erik Hatcher > > wrote: > > > > > > > > What kinds of searches do you want to run? Are you trying to extract > > > class names, method names, and such and make those searchable? If > > that’s > > > the case, you need some kind of “parser” to reverse engineer that > > > information from .class and .jar files before feeding it to Solr, which > > > would happen before analysis. Java itself comes with a javap command > > that > > > can do this; whether this is the “best” way to go for your scenario I > > don’t > > > know, but here’s an interesting example pasted below (using Solr 5.x). > > > > > > > > — > > > > Erik Hatcher, Senior Solutions Architect > > > > http://www.lucidworks.com > > > > > > > > > > > > javap > > > build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class > > > > test.txt > > > > bin/post -c java test.txt > > > > > > > > now search for "coreInfoMap" > > > http://localhost:8983/solr/java/browse?q=coreInfoMap > > > > > > > > I tried to be cleverer and use the stdin option of bin/post, like > this: > > > > javap > > > build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class | > > > bin/post -c java -url http://localhost:8983/solr/java/update/extract > > > -type text/plain -params "literal.id=SolrLogFormatter" -out yes -d > > > > but something isn’t working right with the stdin detection like that > > (it > > > does work to `cat test.txt | bin/post…` though, hmmm) > > > > > > > > test.txt looks like this, `cat test.txt`: > > > > Compiled from "SolrLogFormatter.java" > > > > public class org.apache.solr.SolrLogFormatter extends > > > java.util.logging.Formatter { > > > > long startTime; > > > > long lastTime; > > > > java.util.Map > > java.lang.String> methodAlias; > > > > public boolean shorterFormat; > > > > java.util.Map > > org.apache.solr.SolrLogFormatter$CoreInfo> coreInfoMap; > > > > public java.util.Map > classAliases; > > > > static java.lang.ThreadLocal threadLocal; > > > > public org.apache.solr.SolrLogFormatter(); > > > > public void setShorterFormat(); > > > > public java.lang.String format(java.util.logging.LogRecord); > > > > public void appendThread(java.lang.StringBuilder, > > > java.util.logging.LogRecord); > > > > public java.lang.String _format(java.util.logging.LogRecord); > > > > public java.lang.String getHead(java.util.logging.Handler); > > > > public java.lang.String getTail(java.util.logging.Handler); > > > > public java.lang.String formatMessage(java.util.logging.LogRecord); > > > > public static void main(java.lang.String[]) throws > > java.lang.Exception; > > > > public static void go() throws java.lang.Exception; > > > > static {}; > > > > } > > > > > > > >> On May 8, 2015, at 3:31 PM, Mark wrote: > > > >> > > > >> I looking to use Solr search over the byte code in Classes and Jars. > > > >> > > > >> Does anyone know or have experience of Analyzers, Tokenizers, and > > Token > > > >> Filters for such a task? > > > >> > > > >> Regards > > > >> > > > >> Mark > > > > > > > > > > > > >
Configuring number or shards
Can you configure the number of shards per collection or is this a system wide setting affecting all collections/indexes? Thanks
Sharding and replicas (Solr Cloud)
If I create my collection via the ZkCLI (https://cwiki.apache.org/confluence/display/solr/Command+Line+Utilities) how do I configure the number of shards and replicas? Thanks
SimplePostTool with extracted Outlook messages
I'm looking to index some outlook extracted messages *.msg I notice by default msg isn't one of the defaults so I tried the following: java -classpath dist/solr-core-4.10.3.jar -Dtype=application/vnd.ms-outlook org.apache.solr.util.SimplePostTool C:/temp/samplemsg/*.msg That didn't work However curl did: curl " http://localhost:8983/solr/update/extract?commit=true&overwrite=true&literal.id=6252671B765A1748992DF1A6403BDF81A4A15E00"; -F "myfile=@6252671B765A1748992DF1A6403BDF81A4A15E00.msg" My question is why does the second work and not the first?
Re: SimplePostTool with extracted Outlook messages
A little further This fails java -classpath dist/solr-core-4.10.3.jar -Dtype=application/vnd.ms-outlook org.apache.solr.util.SimplePostTool C:/temp/samplemsg/*.msg With: SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 415 for URL: http://localhost:8983/solr/update POSTing file 6252671B765A1748992DF1A6403BDF81A4A22C00.msg SimplePostTool: WARNING: Solr returned an error #415 (Unsupported Media Type) for url: http://localhost:8983/solr/update SimplePostTool: WARNING: Response: 4150Unsupported ContentType: application/vnd.ms-outlook Not in: [applicat ion/xml, text/csv, text/json, application/csv, application/javabin, text/xml, application/json]415 However just calling the extract works curl "http://localhost:8983/solr/update/extract?extractOnly=true"; -F "myfile=@6252671B765A1748992DF1A6403BDF81A4A22C00.msg" Regards Mark On 26 January 2015 at 21:47, Alexandre Rafalovitch wrote: > Seems like apple to oranges comparison here. > > I would try giving an explicit end point (.../extract), a single > message, and a literal id for the SimplePostTool and seeing whether > that works. Not providing an ID could definitely be an issue. > > I would also specifically look on the server side in the logs and see > what the messages say to understand the discrepancies. Solr 5 is a bit > more verbose about what's going under the covers, but that's not > available yet. > > Regards, >Alex. > > Sign up for my Solr resources newsletter at http://www.solr-start.com/ > > > On 26 January 2015 at 16:34, Mark wrote: > > I'm looking to index some outlook extracted messages *.msg > > > > I notice by default msg isn't one of the defaults so I tried the > following: > > > > java -classpath dist/solr-core-4.10.3.jar > -Dtype=application/vnd.ms-outlook > > org.apache.solr.util.SimplePostTool C:/temp/samplemsg/*.msg > > > > That didn't work > > > > However curl did: > > > > curl " > > > http://localhost:8983/solr/update/extract?commit=true&overwrite=true&literal.id=6252671B765A1748992DF1A6403BDF81A4A15E00 > " > > -F "myfile=@6252671B765A1748992DF1A6403BDF81A4A15E00.msg" > > > > My question is why does the second work and not the first? >
Re: SimplePostTool with extracted Outlook messages
Fantastic - that explians it Adding -Durl=" http://localhost:8983/solr/update/extract?commit=true&overwrite=true"; Get's me a little further POSTing file 6252671B765A1748992DF1A6403BDF81A4A22E00.msg SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for url: http://localhost:8983/solr/update/extract?commit=true&overwrite=true SimplePostTool: WARNING: Response: 40044Document is missing mandatory uniqueKey field: id400 However not much use when recursing a directory and the URL essentially has to change to pass the document ID I think I may just extend SimplePostToo or look to use Solr Cell perhaps? On 26 January 2015 at 22:14, Alexandre Rafalovitch wrote: > Well, you are NOT posting to the same URL. > > > On 26 January 2015 at 17:00, Mark wrote: > > http://localhost:8983/solr/update > > > > > Sign up for my Solr resources newsletter at http://www.solr-start.com/ >
Re: SimplePostTool with extracted Outlook messages
Thanks Eric However java -classpath dist/solr-core-4.10.3.jar -Dauto=true org.apache.solr.util.SimplePostTool C:/temp/samplemsg/*.msg Fails with: osting files to base url http://localhost:8983/solr/update.. ntering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log implePostTool: WARNING: Skipping 6252671B765A1748992DF1A6403BDF81A4A02A00.msg. Unsupported file type for auto mode. implePostTool: WARNING: Skipping 6252671B765A1748992DF1A6403BDF81A4A02B00.msg. Unsupported file type for auto mode. implePostTool: WARNING: Skipping 6252671B765A1748992DF1A6403BDF81A4A02C00.msg. Unsupported file type for auto mode. That's where I started looking into extending or adding support for additional types. Looking into the code as it stands passing you own URL as well as asking it to recurse a folder means that is requires an ID strategy - which I believe is lacking. Reagrds Mark On 27 January 2015 at 10:57, Erik Hatcher wrote: > Try adding -Dauto=true and take away setting url. The type probably isn't > needed then either. > > With the new Solr 5 bin/post it sets auto=true implicitly. > > Erik > > > > On Jan 26, 2015, at 17:29, Mark wrote: > > > > Fantastic - that explians it > > > > Adding -Durl=" > > http://localhost:8983/solr/update/extract?commit=true&overwrite=true"; > > > > Get's me a little further > > > > POSTing file 6252671B765A1748992DF1A6403BDF81A4A22E00.msg > > SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for > url: > > http://localhost:8983/solr/update/extract?commit=true&overwrite=true > > SimplePostTool: WARNING: Response: > > > > 400 > name="QTime">44Document is > > missing mandatory uniqueKey field: id400 > > > > > > However not much use when recursing a directory and the URL essentially > has > > to change to pass the document ID > > > > I think I may just extend SimplePostToo or look to use Solr Cell perhaps? > > > > > > > > On 26 January 2015 at 22:14, Alexandre Rafalovitch > > wrote: > > > >> Well, you are NOT posting to the same URL. > >> > >> > >>> On 26 January 2015 at 17:00, Mark wrote: > >>> http://localhost:8983/solr/update > >> > >> > >> > >> > >> Sign up for my Solr resources newsletter at http://www.solr-start.com/ > >> >
Re: SimplePostTool with extracted Outlook messages
Hi Alex, On an individual file basis that would work, since you could set the ID on an individual basis. However recuring a folder it doesn't work, and worse still the server complains, unless on the server side you can use the UpdateRequestProcessor chains with UUID generator as you suggested. Thanks for eveyones suggestions. Regards Mark On 27 January 2015 at 18:01, Alexandre Rafalovitch wrote: > Your IDs seem to be the file names, which you are probably also getting > from your parsing the file. Can't you just set (or copyField) that as an ID > on the Solr side? > > Alternatively, if you don't actually have good IDs, you could look into > UpdateRequestProcessor chains with UUID generator. > > Regards, > >Alex. > On 27/01/2015 12:24 pm, "Mark" wrote: > > > Thanks Eric > > > > However > > > > java -classpath dist/solr-core-4.10.3.jar -Dauto=true > > org.apache.solr.util.SimplePostTool C:/temp/samplemsg/*.msg > > > > Fails with: > > > > osting files to base url http://localhost:8983/solr/update.. > > ntering auto mode. File endings considered are > > > > > xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log > > implePostTool: WARNING: Skipping > > 6252671B765A1748992DF1A6403BDF81A4A02A00.msg. Unsupported file > type > > for auto mode. > > implePostTool: WARNING: Skipping > > 6252671B765A1748992DF1A6403BDF81A4A02B00.msg. Unsupported file > type > > for auto mode. > > implePostTool: WARNING: Skipping > > 6252671B765A1748992DF1A6403BDF81A4A02C00.msg. Unsupported file > type > > for auto mode. > > > > That's where I started looking into extending or adding support for > > additional types. > > > > Looking into the code as it stands passing you own URL as well as asking > it > > to recurse a folder means that is requires an ID strategy - which I > believe > > is lacking. > > > > Reagrds > > > > Mark > > > > >
Re: SimplePostTool with extracted Outlook messages
In the end I didn't find a way to add a new file/ mime type for recursing a folder. So I added msg to the static dtring and Mime map. private static final String DEFAULT_FILE_TYPES = "xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log,msg"; mimeMap.put("msg", "application/vnd.ms-outlook"); Regards Mark On 27 January 2015 at 18:39, Mark wrote: > Hi Alex, > > On an individual file basis that would work, since you could set the ID on > an individual basis. > > However recuring a folder it doesn't work, and worse still the server > complains, unless on the server side you can use the UpdateRequestProcessor > chains with UUID generator as you suggested. > > Thanks for eveyones suggestions. > > Regards > > Mark > > On 27 January 2015 at 18:01, Alexandre Rafalovitch > wrote: > >> Your IDs seem to be the file names, which you are probably also getting >> from your parsing the file. Can't you just set (or copyField) that as an >> ID >> on the Solr side? >> >> Alternatively, if you don't actually have good IDs, you could look into >> UpdateRequestProcessor chains with UUID generator. >> >> Regards, >> >>Alex. >> On 27/01/2015 12:24 pm, "Mark" wrote: >> >> > Thanks Eric >> > >> > However >> > >> > java -classpath dist/solr-core-4.10.3.jar -Dauto=true >> > org.apache.solr.util.SimplePostTool C:/temp/samplemsg/*.msg >> > >> > Fails with: >> > >> > osting files to base url http://localhost:8983/solr/update.. >> > ntering auto mode. File endings considered are >> > >> > >> xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log >> > implePostTool: WARNING: Skipping >> > 6252671B765A1748992DF1A6403BDF81A4A02A00.msg. Unsupported file >> type >> > for auto mode. >> > implePostTool: WARNING: Skipping >> > 6252671B765A1748992DF1A6403BDF81A4A02B00.msg. Unsupported file >> type >> > for auto mode. >> > implePostTool: WARNING: Skipping >> > 6252671B765A1748992DF1A6403BDF81A4A02C00.msg. Unsupported file >> type >> > for auto mode. >> > >> > That's where I started looking into extending or adding support for >> > additional types. >> > >> > Looking into the code as it stands passing you own URL as well as >> asking it >> > to recurse a folder means that is requires an ID strategy - which I >> believe >> > is lacking. >> > >> > Reagrds >> > >> > Mark >> > >> > >> > >
extract and add fields on the fly
Is it possible to use curl to upload a document (for extract & indexing) and specify some fields on the fly? sort of: 1) index this document 2) by the way here are some important facets whilst your at it Regards Mark
Re: extract and add fields on the fly
"Create the SID from the existing doc" implies that a document already exists that you wish to add fields to. However if the document is a binary are you suggesting 1) curl to upload/extract passing docID 2) obtain a SID based off docID 3) add addtinal fields to SID & commit I know I'm possibly wandering into the schemaless teritory here as well On 28 January 2015 at 17:11, Andrew Pawloski wrote: > I would switch the order of those. Add the new fields and *then* index to > solr. > > We do something similar when we create SolrInputDocuments that are pushed > to solr. Create the SID from the existing doc, add any additional fields, > then add to solr. > > On Wed, Jan 28, 2015 at 11:56 AM, Mark wrote: > > > Is it possible to use curl to upload a document (for extract & indexing) > > and specify some fields on the fly? > > > > sort of: > > 1) index this document > > 2) by the way here are some important facets whilst your at it > > > > Regards > > > > Mark > > >
Re: extract and add fields on the fly
Second thoughts SID is purely i/p as its name suggests :) I think a better approach would be 1) curl to upload/extract passing docID 2) curl to update additional fields for that docID On 28 January 2015 at 17:30, Mark wrote: > > "Create the SID from the existing doc" implies that a document already > exists that you wish to add fields to. > > However if the document is a binary are you suggesting > > 1) curl to upload/extract passing docID > 2) obtain a SID based off docID > 3) add addtinal fields to SID & commit > > I know I'm possibly wandering into the schemaless teritory here as well > > > On 28 January 2015 at 17:11, Andrew Pawloski wrote: > >> I would switch the order of those. Add the new fields and *then* index to >> solr. >> >> We do something similar when we create SolrInputDocuments that are pushed >> to solr. Create the SID from the existing doc, add any additional fields, >> then add to solr. >> >> On Wed, Jan 28, 2015 at 11:56 AM, Mark wrote: >> >> > Is it possible to use curl to upload a document (for extract & indexing) >> > and specify some fields on the fly? >> > >> > sort of: >> > 1) index this document >> > 2) by the way here are some important facets whilst your at it >> > >> > Regards >> > >> > Mark >> > >> > >
Re: extract and add fields on the fly
I'm looking to 1) upload a binary document using curl 2) add some additional facets Specifically my question is can this be achieved in 1 curl operation or does it need 2? On 28 January 2015 at 17:43, Mark wrote: > > Second thoughts SID is purely i/p as its name suggests :) > > I think a better approach would be > > 1) curl to upload/extract passing docID > 2) curl to update additional fields for that docID > > > > On 28 January 2015 at 17:30, Mark wrote: > >> >> "Create the SID from the existing doc" implies that a document already >> exists that you wish to add fields to. >> >> However if the document is a binary are you suggesting >> >> 1) curl to upload/extract passing docID >> 2) obtain a SID based off docID >> 3) add addtinal fields to SID & commit >> >> I know I'm possibly wandering into the schemaless teritory here as well >> >> >> On 28 January 2015 at 17:11, Andrew Pawloski wrote: >> >>> I would switch the order of those. Add the new fields and *then* index to >>> solr. >>> >>> We do something similar when we create SolrInputDocuments that are pushed >>> to solr. Create the SID from the existing doc, add any additional fields, >>> then add to solr. >>> >>> On Wed, Jan 28, 2015 at 11:56 AM, Mark wrote: >>> >>> > Is it possible to use curl to upload a document (for extract & >>> indexing) >>> > and specify some fields on the fly? >>> > >>> > sort of: >>> > 1) index this document >>> > 2) by the way here are some important facets whilst your at it >>> > >>> > Regards >>> > >>> > Mark >>> > >>> >> >> >
Re: extract and add fields on the fly
Use case is use curl to upload/extract/index document passing in additional facets not present in the document e.g. literal.source="old system" In this way some fields come from the uploaded extracted content and some fields as specified in the curl URL Hope that's clearer? Regards Mark On 28 January 2015 at 17:54, Alexandre Rafalovitch wrote: > Sounds like 'literal.X' syntax from > > https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika > > Can you explain your use case as different from what's already > documented? May be easier to understand. > > Regards, >Alex. > > Sign up for my Solr resources newsletter at http://www.solr-start.com/ > > > On 28 January 2015 at 12:45, Mark wrote: > > I'm looking to > > > > 1) upload a binary document using curl > > 2) add some additional facets > > > > Specifically my question is can this be achieved in 1 curl operation or > > does it need 2? > > > > On 28 January 2015 at 17:43, Mark wrote: > > > >> > >> Second thoughts SID is purely i/p as its name suggests :) > >> > >> I think a better approach would be > >> > >> 1) curl to upload/extract passing docID > >> 2) curl to update additional fields for that docID > >> > >> > >> > >> On 28 January 2015 at 17:30, Mark wrote: > >> > >>> > >>> "Create the SID from the existing doc" implies that a document already > >>> exists that you wish to add fields to. > >>> > >>> However if the document is a binary are you suggesting > >>> > >>> 1) curl to upload/extract passing docID > >>> 2) obtain a SID based off docID > >>> 3) add addtinal fields to SID & commit > >>> > >>> I know I'm possibly wandering into the schemaless teritory here as well > >>> > >>> > >>> On 28 January 2015 at 17:11, Andrew Pawloski > wrote: > >>> > >>>> I would switch the order of those. Add the new fields and *then* > index to > >>>> solr. > >>>> > >>>> We do something similar when we create SolrInputDocuments that are > pushed > >>>> to solr. Create the SID from the existing doc, add any additional > fields, > >>>> then add to solr. > >>>> > >>>> On Wed, Jan 28, 2015 at 11:56 AM, Mark wrote: > >>>> > >>>> > Is it possible to use curl to upload a document (for extract & > >>>> indexing) > >>>> > and specify some fields on the fly? > >>>> > > >>>> > sort of: > >>>> > 1) index this document > >>>> > 2) by the way here are some important facets whilst your at it > >>>> > > >>>> > Regards > >>>> > > >>>> > Mark > >>>> > > >>>> > >>> > >>> > >> >
Re: extract and add fields on the fly
That approach works although as suspected the schma has to recognise the additinal facet (stuff in this case): "responseHeader":{"status":400,"QTime":1},"error":{"msg":"ERROR: [doc=6252671B765A1748992DF1A6403BDF81A4A15E00] unknown field 'stuff'","code":400}} ..getting closer.. On 28 January 2015 at 18:03, Mark wrote: > > Use case is > > use curl to upload/extract/index document passing in additional facets not > present in the document e.g. literal.source="old system" > > In this way some fields come from the uploaded extracted content and some > fields as specified in the curl URL > > Hope that's clearer? > > Regards > > Mark > > > On 28 January 2015 at 17:54, Alexandre Rafalovitch > wrote: > >> Sounds like 'literal.X' syntax from >> >> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika >> >> Can you explain your use case as different from what's already >> documented? May be easier to understand. >> >> Regards, >>Alex. >> >> Sign up for my Solr resources newsletter at http://www.solr-start.com/ >> >> >> On 28 January 2015 at 12:45, Mark wrote: >> > I'm looking to >> > >> > 1) upload a binary document using curl >> > 2) add some additional facets >> > >> > Specifically my question is can this be achieved in 1 curl operation or >> > does it need 2? >> > >> > On 28 January 2015 at 17:43, Mark wrote: >> > >> >> >> >> Second thoughts SID is purely i/p as its name suggests :) >> >> >> >> I think a better approach would be >> >> >> >> 1) curl to upload/extract passing docID >> >> 2) curl to update additional fields for that docID >> >> >> >> >> >> >> >> On 28 January 2015 at 17:30, Mark wrote: >> >> >> >>> >> >>> "Create the SID from the existing doc" implies that a document already >> >>> exists that you wish to add fields to. >> >>> >> >>> However if the document is a binary are you suggesting >> >>> >> >>> 1) curl to upload/extract passing docID >> >>> 2) obtain a SID based off docID >> >>> 3) add addtinal fields to SID & commit >> >>> >> >>> I know I'm possibly wandering into the schemaless teritory here as >> well >> >>> >> >>> >> >>> On 28 January 2015 at 17:11, Andrew Pawloski >> wrote: >> >>> >> >>>> I would switch the order of those. Add the new fields and *then* >> index to >> >>>> solr. >> >>>> >> >>>> We do something similar when we create SolrInputDocuments that are >> pushed >> >>>> to solr. Create the SID from the existing doc, add any additional >> fields, >> >>>> then add to solr. >> >>>> >> >>>> On Wed, Jan 28, 2015 at 11:56 AM, Mark wrote: >> >>>> >> >>>> > Is it possible to use curl to upload a document (for extract & >> >>>> indexing) >> >>>> > and specify some fields on the fly? >> >>>> > >> >>>> > sort of: >> >>>> > 1) index this document >> >>>> > 2) by the way here are some important facets whilst your at it >> >>>> > >> >>>> > Regards >> >>>> > >> >>>> > Mark >> >>>> > >> >>>> >> >>> >> >>> >> >> >> > >
Re: extract and add fields on the fly
Thanks Alexandre, I figured it out with this example, https://wiki.apache.org/solr/ExtractingRequestHandler whereby you can add additional fields at upload/extract time curl " http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_txt&boost.foo_txt=3&literal.blah_s=Bah"; -F "tutorial=@"help.pdf and therefore I learned that you can't update a field that isn't in the original which is what I was trying to do before. Regards Mark On 28 January 2015 at 18:38, Alexandre Rafalovitch wrote: > Well, the schema does need to know what type your field is. If you > can't add it to schema, use dynamicFields with prefixe/suffixes or > dynamic schema (less recommended). > > Regards, >Alex. > > Sign up for my Solr resources newsletter at http://www.solr-start.com/ > > > On 28 January 2015 at 13:32, Mark wrote: > > That approach works although as suspected the schma has to recognise the > > additinal facet (stuff in this case): > > > > "responseHeader":{"status":400,"QTime":1},"error":{"msg":"ERROR: > > [doc=6252671B765A1748992DF1A6403BDF81A4A15E00] unknown field > > 'stuff'","code":400}} > > > > ..getting closer.. > > > > On 28 January 2015 at 18:03, Mark wrote: > > > >> > >> Use case is > >> > >> use curl to upload/extract/index document passing in additional facets > not > >> present in the document e.g. literal.source="old system" > >> > >> In this way some fields come from the uploaded extracted content and > some > >> fields as specified in the curl URL > >> > >> Hope that's clearer? > >> > >> Regards > >> > >> Mark > >> > >> > >> On 28 January 2015 at 17:54, Alexandre Rafalovitch > >> wrote: > >> > >>> Sounds like 'literal.X' syntax from > >>> > >>> > https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika > >>> > >>> Can you explain your use case as different from what's already > >>> documented? May be easier to understand. > >>> > >>> Regards, > >>>Alex. > >>> > >>> Sign up for my Solr resources newsletter at http://www.solr-start.com/ > >>> > >>> > >>> On 28 January 2015 at 12:45, Mark wrote: > >>> > I'm looking to > >>> > > >>> > 1) upload a binary document using curl > >>> > 2) add some additional facets > >>> > > >>> > Specifically my question is can this be achieved in 1 curl operation > or > >>> > does it need 2? > >>> > > >>> > On 28 January 2015 at 17:43, Mark wrote: > >>> > > >>> >> > >>> >> Second thoughts SID is purely i/p as its name suggests :) > >>> >> > >>> >> I think a better approach would be > >>> >> > >>> >> 1) curl to upload/extract passing docID > >>> >> 2) curl to update additional fields for that docID > >>> >> > >>> >> > >>> >> > >>> >> On 28 January 2015 at 17:30, Mark wrote: > >>> >> > >>> >>> > >>> >>> "Create the SID from the existing doc" implies that a document > already > >>> >>> exists that you wish to add fields to. > >>> >>> > >>> >>> However if the document is a binary are you suggesting > >>> >>> > >>> >>> 1) curl to upload/extract passing docID > >>> >>> 2) obtain a SID based off docID > >>> >>> 3) add addtinal fields to SID & commit > >>> >>> > >>> >>> I know I'm possibly wandering into the schemaless teritory here as > >>> well > >>> >>> > >>> >>> > >>> >>> On 28 January 2015 at 17:11, Andrew Pawloski > >>> wrote: > >>> >>> > >>> >>>> I would switch the order of those. Add the new fields and *then* > >>> index to > >>> >>>> solr. > >>> >>>> > >>> >>>> We do something similar when we create SolrInputDocuments that are > >>> pushed > >>> >>>> to solr. Create the SID from the existing doc, add any additional > >>> fields, > >>> >>>> then add to solr. > >>> >>>> > >>> >>>> On Wed, Jan 28, 2015 at 11:56 AM, Mark > wrote: > >>> >>>> > >>> >>>> > Is it possible to use curl to upload a document (for extract & > >>> >>>> indexing) > >>> >>>> > and specify some fields on the fly? > >>> >>>> > > >>> >>>> > sort of: > >>> >>>> > 1) index this document > >>> >>>> > 2) by the way here are some important facets whilst your at it > >>> >>>> > > >>> >>>> > Regards > >>> >>>> > > >>> >>>> > Mark > >>> >>>> > > >>> >>>> > >>> >>> > >>> >>> > >>> >> > >>> > >> > >> >
Duplicate documents based on attribute
How would I go about doing something like this. Not sure if this is something that can be accomplished on the index side or its something that should be done in our application. Say we are an online store for shoes and we are selling Product A in red, blue and green. Is there a way when we search for Product A all three results can be returned even though they are logically the same item (same product in our database). Thoughts on how this can be accomplished? Thanks - M
Re: Duplicate documents based on attribute
I was hoping to do this from within Solr, that way I don't have to manually mess around with pagination. The number of items on each page would be indeterministic. On Jul 25, 2013, at 9:48 AM, Anshum Gupta wrote: > Have a multivalued stored 'color' field and just iterate on it outside of > solr. > > > On Thu, Jul 25, 2013 at 10:12 PM, Mark wrote: > >> How would I go about doing something like this. Not sure if this is >> something that can be accomplished on the index side or its something that >> should be done in our application. >> >> Say we are an online store for shoes and we are selling Product A in red, >> blue and green. Is there a way when we search for Product A all three >> results can be returned even though they are logically the same item (same >> product in our database). >> >> Thoughts on how this can be accomplished? >> >> Thanks >> >> - M > > > > > -- > > Anshum Gupta > http://www.anshumgupta.net
Alternative searches
Can someone explain how one would go about providing alternative searches for a query… similar to Amazon. For example say I search for "Red Dump Truck" - 0 results for "Red Dump Truck" - 500 results for " Red Truck" - 350 results for "Dump Truck" Does this require multiple searches? Thanks
Percolate feature?
We have a set number of known terms we want to match against. In Index: "term one" "term two" "term three" I know how to match all terms of a user query against the index but we would like to know how/if we can match a user's query against all the terms in the index? Search Queries: "my search term" => 0 matches "my term search one" => 1 match ("term one") "some prefix term two" => 1 match ("term two") "one two three" => 0 matches I can only explain this is almost a reverse search??? I came across the following from ElasticSearch (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds like this may accomplish the above but haven't tested. I was wondering if Solr had something similar or an alternative way of accomplishing this? Thanks
Re: Problems matching delimited field
That was it… thanks On Aug 2, 2013, at 3:27 PM, Shawn Heisey wrote: > On 8/2/2013 4:16 PM, Robert Zotter wrote: >> The problem is the query get's expanded to "1 Foo" not ( "1" OR "Foo") >> >> 1Foo >> 1Foo >> +DisjunctionMaxQuery((name_textsv:"1 foo")) () >> +(name_textsv:"1 foo") () >> >> DisMaxQParser >> >> > > This looks like you have autoGeneratePhraseQueries turned on for the field > definition in your schema, either explicitly or by having a "version" > attribute in schema.xml of 1.3 or lower. The current schema version is 1.5. > > Thanks, > Shawn >
Re: Percolate feature?
> "can match a user's query against all the terms in the index" - that's > exactly what Lucene and Solr have done since Day One, for all queries. > Percolate actually does the opposite - matches an input document against a > registered set of queries - and doesn't match against indexed documents. > > Solr does support Lucene's "min should match" feature so that you can > specify, say, four query terms and return if at least two match. This is the > "mm" parameter. I don't think you understand me. Say I only have one document indexed and it's contents are "Foo Bar". I want this documented returned if and only if the query has the words "Foo" and "Bar" in it. If I use a mm of 100% for "Foo Bar Bazz" this document will not be returned because the full user query didn't match. I i use a 0% mm and search "Foo Baz" the documented will be returned even though it shouldn't. On Aug 2, 2013, at 5:09 PM, Jack Krupansky wrote: > You seem to be mixing a couple of different concepts here. "Prospective > search" or reverse search, (sometimes called alerts) is a logistics matter, > but how to match terms is completely different. > > Solr does not have the exact "percolate" feature of ES, but your examples > don't indicate a need for what percolate actually does. > > "can match a user's query against all the terms in the index" - that's > exactly what Lucene and Solr have done since Day One, for all queries. > Percolate actually does the opposite - matches an input document against a > registered set of queries - and doesn't match against indexed documents. > > Solr does support Lucene's "min should match" feature so that you can > specify, say, four query terms and return if at least two match. This is the > "mm" parameter. > > See: > http://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29 > > Try to clarify your requirements... or maybe min-should-match was all you > needed? > > -- Jack Krupansky > > -Original Message- From: Mark > Sent: Friday, August 02, 2013 7:50 PM > To: solr-user@lucene.apache.org > Subject: Percolate feature? > > We have a set number of known terms we want to match against. > > In Index: > "term one" > "term two" > "term three" > > I know how to match all terms of a user query against the index but we would > like to know how/if we can match a user's query against all the terms in the > index? > > Search Queries: > "my search term" => 0 matches > "my term search one" => 1 match ("term one") > "some prefix term two" => 1 match ("term two") > "one two three" => 0 matches > > I can only explain this is almost a reverse search??? > > I came across the following from ElasticSearch > (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds > like this may accomplish the above but haven't tested. I was wondering if > Solr had something similar or an alternative way of accomplishing this? > > Thanks >
Re: Percolate feature?
Still not understanding. How do I know which words to require while searching? I want to search across all documents and return ones that have all of their terms matched. >> I came across the following from ElasticSearch >> (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds >> like this may accomplish the above but haven't tested. I was wondering if >> Solr had something similar or an alternative way of accomplishing this? Also never said this was Percolate, just looked similar On Aug 5, 2013, at 11:43 AM, "Jack Krupansky" wrote: > Fine, then write the query that way: +foo +bar baz > > But it still doesn't sound as if any of this relates to prospective > search/percolate. > > -- Jack Krupansky > > -Original Message- From: Mark > Sent: Monday, August 05, 2013 2:11 PM > To: solr-user@lucene.apache.org > Subject: Re: Percolate feature? > >> "can match a user's query against all the terms in the index" - that's >> exactly what Lucene and Solr have done since Day One, for all queries. >> Percolate actually does the opposite - matches an input document against a >> registered set of queries - and doesn't match against indexed documents. >> >> Solr does support Lucene's "min should match" feature so that you can >> specify, say, four query terms and return if at least two match. This is >> the "mm" parameter. > > > I don't think you understand me. > > Say I only have one document indexed and it's contents are "Foo Bar". I want > this documented returned if and only if the query has the words "Foo" and > "Bar" in it. If I use a mm of 100% for "Foo Bar Bazz" this document will not > be returned because the full user query didn't match. I i use a 0% mm and > search "Foo Baz" the documented will be returned even though it shouldn't. > > On Aug 2, 2013, at 5:09 PM, Jack Krupansky wrote: > >> You seem to be mixing a couple of different concepts here. "Prospective >> search" or reverse search, (sometimes called alerts) is a logistics matter, >> but how to match terms is completely different. >> >> Solr does not have the exact "percolate" feature of ES, but your examples >> don't indicate a need for what percolate actually does. >> >> "can match a user's query against all the terms in the index" - that's >> exactly what Lucene and Solr have done since Day One, for all queries. >> Percolate actually does the opposite - matches an input document against a >> registered set of queries - and doesn't match against indexed documents. >> >> Solr does support Lucene's "min should match" feature so that you can >> specify, say, four query terms and return if at least two match. This is >> the "mm" parameter. >> >> See: >> http://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29 >> >> Try to clarify your requirements... or maybe min-should-match was all you >> needed? >> >> -- Jack Krupansky >> >> -Original Message- From: Mark >> Sent: Friday, August 02, 2013 7:50 PM >> To: solr-user@lucene.apache.org >> Subject: Percolate feature? >> >> We have a set number of known terms we want to match against. >> >> In Index: >> "term one" >> "term two" >> "term three" >> >> I know how to match all terms of a user query against the index but we would >> like to know how/if we can match a user's query against all the terms in the >> index? >> >> Search Queries: >> "my search term" => 0 matches >> "my term search one" => 1 match ("term one") >> "some prefix term two" => 1 match ("term two") >> "one two three" => 0 matches >> >> I can only explain this is almost a reverse search??? >> >> I came across the following from ElasticSearch >> (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds >> like this may accomplish the above but haven't tested. I was wondering if >> Solr had something similar or an alternative way of accomplishing this? >> >> Thanks
Re: Percolate feature?
Ok forget the mention of percolate. We have a large list of known keywords we would like to match against. Product keyword: "Sony" Product keyword: "Samsung Galaxy" We would like to be able to detect given a product title whether or not it matches any known keywords. For a keyword to be matched all of it's terms must be present in the product title given. Product Title: "Sony Experia" Matches and returns a highlight: "Sony Experia" Product Title: "Samsung 52inch LC" Does not match Product Title: "Samsung Galaxy S4" Matches a returns a highlight: "Samsung Galaxy" Product Title: "Galaxy Samsung S4" Matches a returns a highlight: " Galaxy Samsung" What would be the best way to approach this? On Aug 5, 2013, at 7:02 PM, Chris Hostetter wrote: > > : Subject: Percolate feature? > > can you give a more concrete, realistic example of what you are trying to > do? your synthetic hypothetical example is kind of hard to make sense of. > > your Subject line and comment that the "percolate" feature of elastic > search sounds like what you want seems to have some lead people down a > path of assuming you want to run these types of queries as documents are > indexed -- but that isn't at all clear to me from the way you worded your > question other then that. > > it's also not clear what aspect of the "results" you really care about -- > are you only looking for the *number* of documents that "match" according > to your concept of matching, or are you looking for a list of matches? > what multiple documents have all of their terms in the query string -- how > should they score relative to eachother? what if a document contains the > same term multiple times, do you expect it to be a match of a query only > if that term appears in the query multiple times as well? do you care > about hte ordering of the terms in the query? the ordering of hte terms in > the document? > > Ideally: describe for us what you wnat to do, w/o assuming > solr/elasticsearch/anything specific about the implementation -- just > describe your actual use case for us, with several real document/query > examples. > > > > https://people.apache.org/~hossman/#xyproblem > XY Problem > > Your question appears to be an "XY Problem" ... that is: you are dealing > with "X", you are assuming "Y" will help you, and you are asking about "Y" > without giving more details about the "X" so that we can understand the > full issue. Perhaps the best solution doesn't involve "Y" at all? > See Also: http://www.perlmonks.org/index.pl?node_id=542341 > > > > > > > -Hoss
Re: Percolate feature?
> *All* of the terms in the field must be matched by the querynot > vice-versa. Exactly. This is why I was trying to explain it as a reverse search. I just realized I describe it as a *large list of known keywords when really its small; no more than 1000. Forgetting about performance how hard do you think this would be to implement? How should I even start? Thanks for the input On Aug 9, 2013, at 6:56 AM, Yonik Seeley wrote: > *All* of the terms in the field must be matched by the querynot > vice-versa. > And no, we don't have a query for that out of the box. To implement, > it seems like it would require the total number of terms indexed for a > field (for each document). > I guess you could also index start and end tokens and then use query > expansion to all possible combinations... messy though. > > -Yonik > http://lucidworks.com > > On Fri, Aug 9, 2013 at 8:19 AM, Erick Erickson > wrote: >> This _looks_ like simple phrase matching (no slop) and highlighting... >> >> But whenever I think the answer is really simple, it usually means >> that I'm missing something >> >> Best >> Erick >> >> >> On Thu, Aug 8, 2013 at 11:18 PM, Mark wrote: >> >>> Ok forget the mention of percolate. >>> >>> We have a large list of known keywords we would like to match against. >>> >>> Product keyword: "Sony" >>> Product keyword: "Samsung Galaxy" >>> >>> We would like to be able to detect given a product title whether or not it >>> matches any known keywords. For a keyword to be matched all of it's terms >>> must be present in the product title given. >>> >>> Product Title: "Sony Experia" >>> Matches and returns a highlight: "Sony Experia" >>> >>> Product Title: "Samsung 52inch LC" >>> Does not match >>> >>> Product Title: "Samsung Galaxy S4" >>> Matches a returns a highlight: "Samsung Galaxy" >>> >>> Product Title: "Galaxy Samsung S4" >>> Matches a returns a highlight: " Galaxy Samsung" >>> >>> What would be the best way to approach this? >>> >>> >>> >>> >>> On Aug 5, 2013, at 7:02 PM, Chris Hostetter >>> wrote: >>> >>>> >>>> : Subject: Percolate feature? >>>> >>>> can you give a more concrete, realistic example of what you are trying to >>>> do? your synthetic hypothetical example is kind of hard to make sense of. >>>> >>>> your Subject line and comment that the "percolate" feature of elastic >>>> search sounds like what you want seems to have some lead people down a >>>> path of assuming you want to run these types of queries as documents are >>>> indexed -- but that isn't at all clear to me from the way you worded your >>>> question other then that. >>>> >>>> it's also not clear what aspect of the "results" you really care about -- >>>> are you only looking for the *number* of documents that "match" according >>>> to your concept of matching, or are you looking for a list of matches? >>>> what multiple documents have all of their terms in the query string -- >>> how >>>> should they score relative to eachother? what if a document contains the >>>> same term multiple times, do you expect it to be a match of a query only >>>> if that term appears in the query multiple times as well? do you care >>>> about hte ordering of the terms in the query? the ordering of hte terms >>> in >>>> the document? >>>> >>>> Ideally: describe for us what you wnat to do, w/o assuming >>>> solr/elasticsearch/anything specific about the implementation -- just >>>> describe your actual use case for us, with several real document/query >>>> examples. >>>> >>>> >>>> >>>> https://people.apache.org/~hossman/#xyproblem >>>> XY Problem >>>> >>>> Your question appears to be an "XY Problem" ... that is: you are dealing >>>> with "X", you are assuming "Y" will help you, and you are asking about >>> "Y" >>>> without giving more details about the "X" so that we can understand the >>>> full issue. Perhaps the best solution doesn't involve "Y" at all? >>>> See Also: http://www.perlmonks.org/index.pl?node_id=542341 >>>> >>>> >>>> >>>> >>>> >>>> >>>> -Hoss >>> >>>
Re: Percolate feature?
I'll look into this. Thanks for the concrete example as I don't even know which classes to start to look at to implement such a feature. On Aug 9, 2013, at 9:49 AM, Roman Chyla wrote: > On Fri, Aug 9, 2013 at 11:29 AM, Mark wrote: > >>> *All* of the terms in the field must be matched by the querynot >> vice-versa. >> >> Exactly. This is why I was trying to explain it as a reverse search. >> >> I just realized I describe it as a *large list of known keywords when >> really its small; no more than 1000. Forgetting about performance how hard >> do you think this would be to implement? How should I even start? >> > > not hard, index all terms into a field - make sure there are no duplicates, > as you want to count them - then I can imagine at least two options: save > the number of terms as a payload together with the terms, or in second step > (in a collector, for example), load the document and count them terms in > the field - if they match the query size, you are done > > a trivial, naive implementation (as you say 'forget performance') could be: > > searcher.search(query, null, new Collector() { > ... > public void collect(int i) throws Exception { > d = reader.document(i, fieldsToLoa); > if (d.getValues(fieldToLoad).size() == query.size()) { >PriorityQueue.add(new ScoreDoc(score, i + docBase)); > } > } > } > > so if your query contains no duplicates and all terms must match, you can > be sure that you are collecting docs only when the number of terms matches > number of clauses in the query > > roman > > >> Thanks for the input >> >> On Aug 9, 2013, at 6:56 AM, Yonik Seeley wrote: >> >>> *All* of the terms in the field must be matched by the querynot >> vice-versa. >>> And no, we don't have a query for that out of the box. To implement, >>> it seems like it would require the total number of terms indexed for a >>> field (for each document). >>> I guess you could also index start and end tokens and then use query >>> expansion to all possible combinations... messy though. >>> >>> -Yonik >>> http://lucidworks.com >>> >>> On Fri, Aug 9, 2013 at 8:19 AM, Erick Erickson >> wrote: >>>> This _looks_ like simple phrase matching (no slop) and highlighting... >>>> >>>> But whenever I think the answer is really simple, it usually means >>>> that I'm missing something >>>> >>>> Best >>>> Erick >>>> >>>> >>>> On Thu, Aug 8, 2013 at 11:18 PM, Mark >> wrote: >>>> >>>>> Ok forget the mention of percolate. >>>>> >>>>> We have a large list of known keywords we would like to match against. >>>>> >>>>> Product keyword: "Sony" >>>>> Product keyword: "Samsung Galaxy" >>>>> >>>>> We would like to be able to detect given a product title whether or >> not it >>>>> matches any known keywords. For a keyword to be matched all of it's >> terms >>>>> must be present in the product title given. >>>>> >>>>> Product Title: "Sony Experia" >>>>> Matches and returns a highlight: "Sony Experia" >>>>> >>>>> Product Title: "Samsung 52inch LC" >>>>> Does not match >>>>> >>>>> Product Title: "Samsung Galaxy S4" >>>>> Matches a returns a highlight: "Samsung Galaxy" >>>>> >>>>> Product Title: "Galaxy Samsung S4" >>>>> Matches a returns a highlight: " Galaxy Samsung" >>>>> >>>>> What would be the best way to approach this? >>>>> >>>>> >>>>> >>>>> >>>>> On Aug 5, 2013, at 7:02 PM, Chris Hostetter >>>>> wrote: >>>>> >>>>>> >>>>>> : Subject: Percolate feature? >>>>>> >>>>>> can you give a more concrete, realistic example of what you are >> trying to >>>>>> do? your synthetic hypothetical example is kind of hard to make sense >> of. >>>>>> >>>>>> your Subject line and comment that the "percolate" feature of elastic >>>>>> search sounds like what you want seems to have some lead people down a
Re: Percolate feature?
> So to reiteratve your examples from before, but change the "labels" a > bit and add some more converse examples (and ignore the "highlighting" > aspect for a moment... > > doc1 = "Sony" > doc2 = "Samsung Galaxy" > doc3 = "Sony Playstation" > > queryA = "Sony Experia" ... matches only doc1 > queryB = "Sony Playstation 3" ... matches doc3 and doc1 > queryC = "Samsung 52inch LC" ... doesn't match anything > queryD = "Samsung Galaxy S4" ... matches doc2 > queryE = "Galaxy Samsung S4" ... matches doc2 > > > ...do i still have that correct? Yes > 2) if you *do* care about using non-trivial analysis, then you can't use > the simple "termfreq()" function, which deals with raw terms -- in stead > you have to use the "query()" function to ensure that the input is parsed > appropriately -- but then you have to wrap that function in something that > will normalize the scores - so in place of termfreq('words','Galaxy') > you'd want something like... Yes we will be using non-trivial analysis. Now heres another twist… what if we don't care about scoring? Let's talk about the real use case. We are marketplace that sells products that users have listed. For certain popular, high risk or restricted keywords we charge the seller an extra fee/ban the listing. We now have sellers purposely misspelling their listings to circumvent this fee. They will start adding suffixes to their product listings such as "Sonies" knowing that it gets indexed down to "Sony" and thus matching a users query for Sony. Or they will munge together numbers and products… "2013Sony". Same thing goes for adding crazy non-ascii characters to the front of the keyword "Î’Sony". This is obviously a problem because we aren't charging for these keywords and more importantly it makes our search results look like shit. We would like to: 1) Detect when a certain keyword is in a product title at listing time so we may charge the seller. This was my idea of a "reverse search" although sounds like I may have caused to much confusion with that term. 2) Attempt to autocorrect these titles hence the need for highlighting so we can try and replace the terms… this of course done outside of Solr via an external service. Since we do some stemming (KStemmer) and filtering (WordDelimiterFilterFactory) this makes conventional approaches such as regex quite troublesome. Regex is also quite slow and scales horribly and always needs to be in lockstep with schema changes. Now knowing this, is there a good way to approach this? Thanks On Aug 9, 2013, at 11:56 AM, Chris Hostetter wrote: > > : I'll look into this. Thanks for the concrete example as I don't even > : know which classes to start to look at to implement such a feature. > > Either roman isn't understanding what you are aksing for, or i'm not -- > but i don't think what roman described will work for you... > > : > so if your query contains no duplicates and all terms must match, you can > : > be sure that you are collecting docs only when the number of terms matches > : > number of clauses in the query > > several of the examples you gave did not match what Roman is describing, > as i understand it. Most people on this thread seem to be getting > confused by having their perceptions "flipped" about what your "data known > in advance is" vs the "data you get at request time". > > You described this... > > : > Product keyword: "Sony" > : > Product keyword: "Samsung Galaxy" > : > > : > We would like to be able to detect given a product title whether or > : >> not it > : > matches any known keywords. For a keyword to be matched all of it's > : >> terms > : > must be present in the product title given. > : > > : > Product Title: "Sony Experia" > : > Matches and returns a highlight: "Sony Experia" > > ...suggesting that what you call "product keywords" are the "data you know > about in advance" and "product titles" are the data you get at request > time. > > So your example of the "request time" input (ie: query) "Sony Experia" > matching "data known in advance (ie: indexed document) "Sony" would not > work with Roman's example. > > To rephrase (what i think i understand is) your goal... > > * you have many (10*3+) documents known in advance > * any document D contain a set of words W(D) of varing sizes > * any requests Q contains a set of words W(Q) of varing izes > * you want a given request R to match a document D if and only if: > - W(D) is a subset of W(Q) > - ie: no iten exists in W(D) that does not exist in W(Q) > - ie: any number of items may exist in W(Q) that are not in W(D) > > So to reiteratve your examples from before, but change the "labels" a > bit and add some more converse examples (and ignore the "highlighting" > aspect for a moment... > > doc1 = "Sony" > doc2 = "Samsung Galaxy" > doc3 = "Sony Playstation" > > queryA = "Sony Experia" ... matches only doc1 > queryB = "Sony Playsta
Re: Percolate feature?
Our schema is pretty basic.. nothing fancy going on here On Aug 10, 2013, at 3:40 PM, "Jack Krupansky" wrote: > Now we're getting somewhere! > > To (over-simplify), you simply want to know if a given "listing" would match > a high-value pattern, either in a "clean" manner (obvious keywords) or in an > "unclean" manner (e.g., fuzzy keyword matching, stemming, n-grams.) > > To a large this also depends on how rich and powerful your end-user query > support is. So, if the user searches for "sony", "samsung", or "apple", will > it match some oddball listing that fuzzily matches those terms. > > So... tell us, how rich your query interface is. I mean, do you support > wildcard, fuzzy query, ngrams (e.g., can they type "son" or "sam" or "app", > or... will "sony" match "sonblah-blah")? > > Reverse-search may in fact be what you need in this case since you literally > do mean "if I index this document, will it match any of these queries" (but > doesn't score a hit on your direct check for whether it is a clean keyword > match.) > > In your previous examples you only gave clean product titles, not examples of > circumventions of simple keyword matches. > > -- Jack Krupansky > > -Original Message- From: Mark > Sent: Saturday, August 10, 2013 6:24 PM > To: solr-user@lucene.apache.org > Cc: Chris Hostetter > Subject: Re: Percolate feature? > >> So to reiteratve your examples from before, but change the "labels" a >> bit and add some more converse examples (and ignore the "highlighting" >> aspect for a moment... >> >> doc1 = "Sony" >> doc2 = "Samsung Galaxy" >> doc3 = "Sony Playstation" >> >> queryA = "Sony Experia" ... matches only doc1 >> queryB = "Sony Playstation 3" ... matches doc3 and doc1 >> queryC = "Samsung 52inch LC" ... doesn't match anything >> queryD = "Samsung Galaxy S4" ... matches doc2 >> queryE = "Galaxy Samsung S4" ... matches doc2 >> >> >> ...do i still have that correct? > > Yes > >> 2) if you *do* care about using non-trivial analysis, then you can't use >> the simple "termfreq()" function, which deals with raw terms -- in stead >> you have to use the "query()" function to ensure that the input is parsed >> appropriately -- but then you have to wrap that function in something that >> will normalize the scores - so in place of termfreq('words','Galaxy') >> you'd want something like... > > > Yes we will be using non-trivial analysis. Now heres another twist… what if > we don't care about scoring? > > > Let's talk about the real use case. We are marketplace that sells products > that users have listed. For certain popular, high risk or restricted keywords > we charge the seller an extra fee/ban the listing. We now have sellers > purposely misspelling their listings to circumvent this fee. They will start > adding suffixes to their product listings such as "Sonies" knowing that it > gets indexed down to "Sony" and thus matching a users query for Sony. Or they > will munge together numbers and products… "2013Sony". Same thing goes for > adding crazy non-ascii characters to the front of the keyword "Î’Sony". This > is obviously a problem because we aren't charging for these keywords and more > importantly it makes our search results look like shit. > > We would like to: > > 1) Detect when a certain keyword is in a product title at listing time so we > may charge the seller. This was my idea of a "reverse search" although sounds > like I may have caused to much confusion with that term. > 2) Attempt to autocorrect these titles hence the need for highlighting so we > can try and replace the terms… this of course done outside of Solr via an > external service. > > Since we do some stemming (KStemmer) and filtering > (WordDelimiterFilterFactory) this makes conventional approaches such as regex > quite troublesome. Regex is also quite slow and scales horribly and always > needs to be in lockstep with schema changes. > > Now knowing this, is there a good way to approach this? > > Thanks > > > On Aug 9, 2013, at 11:56 AM, Chris Hostetter wrote: > >> >> : I'll look into t
Re: Percolate feature?
Any ideas? On Aug 10, 2013, at 6:28 PM, Mark wrote: > Our schema is pretty basic.. nothing fancy going on here > > > > > > protected="protected.txt"/> > generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" > preserveOriginal="1"/> > > > > > > > ignoreCase="true" expand="true"/> > protected="protected.txt"/> > generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" > preserveOriginal="1"/> > > > > > > > On Aug 10, 2013, at 3:40 PM, "Jack Krupansky" wrote: > >> Now we're getting somewhere! >> >> To (over-simplify), you simply want to know if a given "listing" would match >> a high-value pattern, either in a "clean" manner (obvious keywords) or in an >> "unclean" manner (e.g., fuzzy keyword matching, stemming, n-grams.) >> >> To a large this also depends on how rich and powerful your end-user query >> support is. So, if the user searches for "sony", "samsung", or "apple", will >> it match some oddball listing that fuzzily matches those terms. >> >> So... tell us, how rich your query interface is. I mean, do you support >> wildcard, fuzzy query, ngrams (e.g., can they type "son" or "sam" or "app", >> or... will "sony" match "sonblah-blah")? >> >> Reverse-search may in fact be what you need in this case since you literally >> do mean "if I index this document, will it match any of these queries" (but >> doesn't score a hit on your direct check for whether it is a clean keyword >> match.) >> >> In your previous examples you only gave clean product titles, not examples >> of circumventions of simple keyword matches. >> >> -- Jack Krupansky >> >> -Original Message- From: Mark >> Sent: Saturday, August 10, 2013 6:24 PM >> To: solr-user@lucene.apache.org >> Cc: Chris Hostetter >> Subject: Re: Percolate feature? >> >>> So to reiteratve your examples from before, but change the "labels" a >>> bit and add some more converse examples (and ignore the "highlighting" >>> aspect for a moment... >>> >>> doc1 = "Sony" >>> doc2 = "Samsung Galaxy" >>> doc3 = "Sony Playstation" >>> >>> queryA = "Sony Experia" ... matches only doc1 >>> queryB = "Sony Playstation 3" ... matches doc3 and doc1 >>> queryC = "Samsung 52inch LC" ... doesn't match anything >>> queryD = "Samsung Galaxy S4" ... matches doc2 >>> queryE = "Galaxy Samsung S4" ... matches doc2 >>> >>> >>> ...do i still have that correct? >> >> Yes >> >>> 2) if you *do* care about using non-trivial analysis, then you can't use >>> the simple "termfreq()" function, which deals with raw terms -- in stead >>> you have to use the "query()" function to ensure that the input is parsed >>> appropriately -- but then you have to wrap that function in something that >>> will normalize the scores - so in place of termfreq('words','Galaxy') >>> you'd want something like... >> >> >> Yes we will be using non-trivial analysis. Now heres another twist… what if >> we don't care about scoring? >> >> >> Let's talk about the real use case. We are marketplace that sells products >> that users have listed. For certain popular, high risk or restricted >> keywords we charge the seller an extra fee/ban the listing. We now have >> sellers purposely misspelling their listings to circumvent this fee. They >> will start adding suffixes to their product listings such as "Sonies" >> knowing that it gets indexed down to "Sony" and thus matching a users query >> for Sony. Or they will munge together numbers and products… "2013Sony". Same >> thing goes for adding crazy non-ascii characters to the front of the keyword >> "Î’Sony". This is obviously a problem because we aren't charging for these >> keywords and more importantly it makes our search results look like shit. >> >> We would like to: &g
App server?
Is Jetty sufficient for running Solr or should I go with something a little more enterprise like tomcat? Any others?
SolrJ best pratices
Are there any links describing best practices for interacting with SolrJ? I've checked the wiki and it seems woefully incomplete: (http://wiki.apache.org/solr/Solrj) Some specific questions: - When working with HttpSolrServer should we keep around instances for ever or should we create a singleton that can/should be used over and over? - Is there a way to change the collection after creating the server or do we need to create a new server for each collection? -..
Bootstrapping / Full Importing using Solr Cloud
We are in the process of upgrading our Solr cluster to the latest and greatest Solr Cloud. I have some questions regarding full indexing though. We're currently running a long job (~30 hours) using DIH to do a full index on over 10M products. This process consumes a lot of memory and while updating can not handle any user requests. How, or what would be the best way going about this when using Solr Cloud? First off, does DIH work with cloud? Would I need to separate out my DIH indexing machine from the machines serving up user requests? If not going down the DIH route, what are my best options (solrj?) Thanks for the input
Re: SolrJ best pratices
Thanks for the clarification. In Solr Cloud just use 1 connection. In non-cloud environments you will need one per core. On Oct 8, 2013, at 5:58 PM, Shawn Heisey wrote: > On 10/7/2013 3:08 PM, Mark wrote: >> Some specific questions: >> - When working with HttpSolrServer should we keep around instances for ever >> or should we create a singleton that can/should be used over and over? >> - Is there a way to change the collection after creating the server or do we >> need to create a new server for each collection? > > If at all possible, you should create your server object and use it for the > life of your application. SolrJ is threadsafe. If there is any part of it > that's not, the javadocs should say so - the SolrServer implementations > definitely are. > > By using the word "collection" you are implying that you are using SolrCloud > ... but earlier you said HttpSolrServer, which implies that you are NOT using > SolrCloud. > > With HttpSolrServer, your base URL includes the core or collection name - > "http://server:port/solr/corename"; for example. Generally you will need one > object for each core/collection, and another object for server-level things > like CoreAdmin. > > With SolrCloud, you should be using CloudSolrServer instead, another > implementation of SolrServer that is constantly aware of the SolrCloud > clusterstate. With that object, you can use setDefaultCollection, and you > can also add a "collection" parameter to each SolrQuery or other request > object. > > Thanks, > Shawn >
Setting SolrCloudServer collection
If using one static SolrCloudServer how can I add a bean to a certain collection. Do I need to update setDefaultCollection() each time? I doubt that thread safe? Thanks
Re: DIH importing
Thanks Ill give that a try On 8/26/11 9:54 AM, simon wrote: It sounds as though you are optimizing the index after the delta import. If you don't do that, then only new segments will be replicated and syncing will be much faster. On Fri, Aug 26, 2011 at 12:08 PM, Mark wrote: We are currently delta-importing using DIH after which all of our servers have to download the full index (16G). This obviously puts quite a strain on our slaves while they are syncing over the index. Is there anyway not to sync over the whole index, but rather just the parts that have changed? We would like to get to the point where are no longer using DIH but rather we are constantly sending documents over HTTP to our master in realtime. We would then like our slaves to download these changes as soon as possible. Is something like this even possible? Thanks for you help
Searching multiple fields
I have a use case where I would like to search across two fields but I do not want to weight a document that has a match in both fields higher than a document that has a match in only 1 field. For example. Document 1 - Field A: "Foo Bar" - Field B: "Foo Baz" Document 2 - Field A: "Foo Blarg" - Field B: "Something else" Now when I search for "Foo" I would like document 1 and 2 to be similarly scored however document 1 will be scored much higher in this use case because it matches in both fields. I could create a third field and use copyField directive to search across that but I was wondering if there is an alternative way. It would be nice if we could search across some sort of "virtual field" that will use both underlying fields but not actually increase the size of the index. Thanks
Re: Searching multiple fields
I thought that a similarity class will only affect the scoring of a single field.. not across multiple fields? Can anyone else chime in with some input? Thanks. On 9/26/11 9:02 PM, Otis Gospodnetic wrote: Hi Mark, Eh, I don't have Lucene/Solr source code handy, but I *think* for that you'd need to write custom Lucene similarity. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ ____ From: Mark To: solr-user@lucene.apache.org Sent: Monday, September 26, 2011 8:12 PM Subject: Searching multiple fields I have a use case where I would like to search across two fields but I do not want to weight a document that has a match in both fields higher than a document that has a match in only 1 field. For example. Document 1 - Field A: "Foo Bar" - Field B: "Foo Baz" Document 2 - Field A: "Foo Blarg" - Field B: "Something else" Now when I search for "Foo" I would like document 1 and 2 to be similarly scored however document 1 will be scored much higher in this use case because it matches in both fields. I could create a third field and use copyField directive to search across that but I was wondering if there is an alternative way. It would be nice if we could search across some sort of "virtual field" that will use both underlying fields but not actually increase the size of the index. Thanks
HBase Datasource
Has anyone had any success/experience with building a HBase datasource for DIH? Are there any solutions available on the web? Thanks.
CachedSqlEntityProcessor
I am trying to use the CachedSqlEntityProcessor with Solr 1.4.2 however I am not seeing any performance gains. I've read some other posts that reference cacheKey and cacheLookup however I don't see any reference to them in the wiki http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor Can someone please clarify? Thanks
Re: CachedSqlEntityProcessor
FYI my sub-entity looks like the following On 11/15/11 10:42 AM, Mark wrote: I am trying to use the CachedSqlEntityProcessor with Solr 1.4.2 however I am not seeing any performance gains. I've read some other posts that reference cacheKey and cacheLookup however I don't see any reference to them in the wiki http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor Can someone please clarify? Thanks
Multithreaded DIH bug
I'm trying to use multiple threads with DIH but I keep receiving the following error.. "Operation not allowed after ResultSet closed" Is there anyway I can fix this? Dec 1, 2011 4:38:47 PM org.apache.solr.common.SolrException log SEVERE: Full Import failed:java.lang.RuntimeException: Error in multi-threaded import at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408) Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: java.sql.SQLException: Operation not allowed after ResultSet closed at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:339) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$600(JdbcDataSource.java:228) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:262) at org.apache.solr.handler.dataimport.CachedSqlEntityProcessor.getAllNonCachedRows(CachedSqlEntityProcessor.java:72) at org.apache.solr.handler.dataimport.EntityProcessorBase.getIdCacheData(EntityProcessorBase.java:201) at org.apache.solr.handler.dataimport.CachedSqlEntityProcessor.nextRow(CachedSqlEntityProcessor.java:60) at org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper.nextRow(ThreadedEntityProcessorWrapper.java:84) at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.runAThread(DocBuilder.java:449) at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.run(DocBuilder.java:402) at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.runAThread(DocBuilder.java:469) at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.access$000(DocBuilder.java:356) at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner$1.run(DocBuilder.java:409) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: java.sql.SQLException: Operation not allowed after ResultSet closed at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927) at com.mysql.jdbc.ResultSetImpl.checkClosed(ResultSetImpl.java:794) at com.mysql.jdbc.ResultSetImpl.next(ResultSetImpl.java:7139) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:331) ... 14 more
Re: Multithreaded DIH bug
Thanks for the info On 12/2/11 1:29 AM, Mikhail Khludnev wrote: Hello, AFAIK Particularly this exception is not a big deal. It's just one of the evidence of the fact that CachedSqlEntityProcessor doesn't work in multiple threads at 3.x and 4.0. It's discussed at http://search-lucene.com/m/0DNn32L2UBv the most problem here is the following messages in the log org.apache.solr.handler.dataimport.*ThreadedEntityProcessorWrapper. nextRow* *arow : null* Some time ago I did the patch for 3.4 (pretty raw) you can try it http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201110.mbox/browser I plan (but only plan, sorry) to address it at 4.0 where SOLR-2382 refactoring has been applied recently. Regards On Fri, Dec 2, 2011 at 4:57 AM, Mark wrote: I'm trying to use multiple threads with DIH but I keep receiving the following error.. "Operation not allowed after ResultSet closed" Is there anyway I can fix this? Dec 1, 2011 4:38:47 PM org.apache.solr.common.**SolrException log SEVERE: Full Import failed:java.lang.**RuntimeException: Error in multi-threaded import at org.apache.solr.handler.**dataimport.DocBuilder.** doFullDump(DocBuilder.java:**268) at org.apache.solr.handler.**dataimport.DocBuilder.execute(** DocBuilder.java:187) at org.apache.solr.handler.**dataimport.DataImporter.** doFullImport(DataImporter.**java:359) at org.apache.solr.handler.**dataimport.DataImporter.** runCmd(DataImporter.java:427) at org.apache.solr.handler.**dataimport.DataImporter$1.run(** DataImporter.java:408) Caused by: org.apache.solr.handler.**dataimport.**DataImportHandlerException: java.sql.SQLException: Operation not allowed after ResultSet closed at org.apache.solr.handler.**dataimport.**DataImportHandlerException.** wrapAndThrow(**DataImportHandlerException.**java:64) at org.apache.solr.handler.**dataimport.JdbcDataSource$** ResultSetIterator.hasnext(**JdbcDataSource.java:339) at org.apache.solr.handler.**dataimport.JdbcDataSource$** ResultSetIterator.access$600(**JdbcDataSource.java:228) at org.apache.solr.handler.**dataimport.JdbcDataSource$** ResultSetIterator$1.hasNext(**JdbcDataSource.java:262) at org.apache.solr.handler.**dataimport.**CachedSqlEntityProcessor.** getAllNonCachedRows(**CachedSqlEntityProcessor.java:**72) at org.apache.solr.handler.**dataimport.**EntityProcessorBase.** getIdCacheData(**EntityProcessorBase.java:201) at org.apache.solr.handler.**dataimport.**CachedSqlEntityProcessor.** nextRow(**CachedSqlEntityProcessor.java:**60) at org.apache.solr.handler.**dataimport.** ThreadedEntityProcessorWrapper**.nextRow(**ThreadedEntityProcessorWrapper* *.java:84) at org.apache.solr.handler.**dataimport.DocBuilder$** EntityRunner.runAThread(**DocBuilder.java:449) at org.apache.solr.handler.**dataimport.DocBuilder$** EntityRunner.run(DocBuilder.**java:402) at org.apache.solr.handler.**dataimport.DocBuilder$** EntityRunner.runAThread(**DocBuilder.java:469) at org.apache.solr.handler.**dataimport.DocBuilder$** EntityRunner.access$000(**DocBuilder.java:356) at org.apache.solr.handler.**dataimport.DocBuilder$** EntityRunner$1.run(DocBuilder.**java:409) at java.util.concurrent.**ThreadPoolExecutor.runWorker(** ThreadPoolExecutor.java:1110) at java.util.concurrent.**ThreadPoolExecutor$Worker.run(** ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.**java:636) Caused by: java.sql.SQLException: Operation not allowed after ResultSet closed at com.mysql.jdbc.SQLError.**createSQLException(SQLError.**java:1073) at com.mysql.jdbc.SQLError.**createSQLException(SQLError.**java:987) at com.mysql.jdbc.SQLError.**createSQLException(SQLError.**java:982) at com.mysql.jdbc.SQLError.**createSQLException(SQLError.**java:927) at com.mysql.jdbc.ResultSetImpl.**checkClosed(ResultSetImpl.**java:794) at com.mysql.jdbc.ResultSetImpl.**next(ResultSetImpl.java:7139) at org.apache.solr.handler.**dataimport.JdbcDataSource$** ResultSetIterator.hasnext(**JdbcDataSource.java:331) ... 14 more
Question on DIH delta imports
*pk*: The primary key for the entity. It is*optional*and only needed when using delta-imports. It has no relation to the uniqueKey defined in schema.xml but they both can be the same. When using in a nested entity is the PK the primary key column of the join table or the key used for joining? For example Say I have table /foo/ whose primary id is id and table /sub_foo/ whose primary key is also id. Table sub_foo also has a column named/foo_id/ that is used for joining /foo/ and /sub_foo/. Should it be: Or: Thanks for your help
Re: Question on DIH delta imports
Anyone? On 12/5/11 11:04 AM, Mark wrote: *pk*: The primary key for the entity. It is*optional*and only needed when using delta-imports. It has no relation to the uniqueKey defined in schema.xml but they both can be the same. When using in a nested entity is the PK the primary key column of the join table or the key used for joining? For example Say I have table /foo/ whose primary id is id and table /sub_foo/ whose primary key is also id. Table sub_foo also has a column named/foo_id/ that is used for joining /foo/ and /sub_foo/. Should it be: Or: Thanks for your help
Design questions/Schema Help
We are thinking about using Cassandra to store our search logs. Can someone point me in the right direction/lend some guidance on design? I am new to Cassandra and I am having trouble wrapping my head around some of these new concepts. My brain keeps wanting to go back to a RDBMS design. We will be storing the user query, # of hits returned and their session id. We would like to be able to answer the following questions. - What is the n most popular queries and their counts within the last x (mins/hours/days/etc). Basically the most popular searches within a given time range. - What is the most popular query within the last x where hits = 0. Same as above but with an extra "where" clause - For session id x give me all their other queries - What are all the session ids that searched for 'foos' We accomplish the above functionality w/ MySQL using 2 tables. One for the raw search log information and the other to keep the aggregate/running counts of queries. Would this sort of ad-hoc querying be better implemented using Hadoop + Hive? If so, should I be storing all this information in Cassandra then using Hadoop to retrieve it? Thanks for your suggestions
Re: Design questions/Schema Help
On 7/26/10 4:43 PM, Mark wrote: We are thinking about using Cassandra to store our search logs. Can someone point me in the right direction/lend some guidance on design? I am new to Cassandra and I am having trouble wrapping my head around some of these new concepts. My brain keeps wanting to go back to a RDBMS design. We will be storing the user query, # of hits returned and their session id. We would like to be able to answer the following questions. - What is the n most popular queries and their counts within the last x (mins/hours/days/etc). Basically the most popular searches within a given time range. - What is the most popular query within the last x where hits = 0. Same as above but with an extra "where" clause - For session id x give me all their other queries - What are all the session ids that searched for 'foos' We accomplish the above functionality w/ MySQL using 2 tables. One for the raw search log information and the other to keep the aggregate/running counts of queries. Would this sort of ad-hoc querying be better implemented using Hadoop + Hive? If so, should I be storing all this information in Cassandra then using Hadoop to retrieve it? Thanks for your suggestions Whoops wrong forum
Solr crawls during replication
We have an index around 25-30G w/ 1 master and 5 slaves. We perform replication every 30 mins. During replication the disk I/O obviously shoots up on the slaves to the point where all requests routed to that slave take a really long time... sometimes to the point of timing out. Is there any logical or physical changes we could make to our architecture to overcome this problem? Thanks
DIH and Cassandra
Is it possible to use DIH with Cassandra either out of the box or with something more custom? Thanks
Throttling replication
Is there any way or forthcoming patch that would allow configuration of how much network bandwith (and ultimately disk I/O) a slave is allowed during replication? We have the current problem of while replicating our disk I/O goes through the roof. I would much rather have the replication take 2x as long with half the disk I/O? Any thoughts? Thanks
Re: Solr crawls during replication
On 8/6/10 5:03 PM, Chris Hostetter wrote: : We have an index around 25-30G w/ 1 master and 5 slaves. We perform : replication every 30 mins. During replication the disk I/O obviously shoots up : on the slaves to the point where all requests routed to that slave take a : really long time... sometimes to the point of timing out. : : Is there any logical or physical changes we could make to our architecture to : overcome this problem? If the problem really is disk I/O then perhaps you don't have enough RAM set asside for the filesystem cache to keep the "current" index in memory? I've seen people have this type of problem before, but usually it's network I/O that is the bottleneck, in which case using multiple NICs on your slaves (one for client requests, one for replication) I think at one point there was also talk about leveraging an rsync option to force snappuller to throttle itself an only use a max amount of bandwidth -- but then we moved away from script based replication to java based replication and i don't think the Java Network/IO system suports that type of throttling. However: you might be able to configure it in your switches/routers (ie: only let the slaves use X% of their total badwidth to talk to the master) -Hoss Thanks for the suggestions. Our slaves have 12G with 10G dedicated to the JVM.. too much? Are the rysnc snappuller featurs still available in 1.4.1? I may try that to see if helps. Configuration of the switches may also be possible. Also, would you mind explaining your second point... using dual NIC cards. How can this be accomplished/configured. Thanks for you help
Re: Throttling replication
On 9/2/10 8:27 AM, Noble Paul നോബിള് नोब्ळ् wrote: There is no way to currently throttle replication. It consumes the whole bandwidth available. It is a nice to have feature On Thu, Sep 2, 2010 at 8:11 PM, Mark wrote: Is there any way or forthcoming patch that would allow configuration of how much network bandwith (and ultimately disk I/O) a slave is allowed during replication? We have the current problem of while replicating our disk I/O goes through the roof. I would much rather have the replication take 2x as long with half the disk I/O? Any thoughts? Thanks Do you mean the consuming the whole bandwith available is a nice to have feature or the ability to throttle the bandwith?
Re: Throttling replication
On 9/2/10 10:21 AM, Brandon Evans wrote: Are you using rsync replication or the built in replication available in solr 1.4? I have a patch that allows easily allows the --bwlimit option to be added to the rsyncd command line. Either way I agree that a way to throttle the replication bandwidth would be nice. -brandon On 9/2/10 7:41 AM, Mark wrote: Is there any way or forthcoming patch that would allow configuration of how much network bandwith (and ultimately disk I/O) a slave is allowed during replication? We have the current problem of while replicating our disk I/O goes through the roof. I would much rather have the replication take 2x as long with half the disk I/O? Any thoughts? Thanks I am using the built in replication. Can you send me a link to the patch so I can give it a try? Thanks
Re: Solr crawls during replication
On 9/3/10 11:37 AM, Jonathan Rochkind wrote: Is the OS disk cache something you configure, or something the OS just does automatically based on available free RAM? Or does it depend on the exact OS? Thinking about the OS disk cache is new to me. Thanks for any tips. From: Shawn Heisey [s...@elyograg.org] Sent: Friday, September 03, 2010 1:46 PM To: solr-user@lucene.apache.org Subject: Re: Solr crawls during replication On 9/2/2010 9:31 AM, Mark wrote: Thanks for the suggestions. Our slaves have 12G with 10G dedicated to the JVM.. too much? Are the rysnc snappuller featurs still available in 1.4.1? I may try that to see if helps. Configuration of the switches may also be possible. Also, would you mind explaining your second point... using dual NIC cards. How can this be accomplished/configured. Thanks for you help I will first admit that I am a relative newbie at this whole thing, so find yourself a grain of salt before you read further ... While it's probably not a bad idea to change to an rsync method and implement bandwidth throttling, I'm betting the real root of your issue is that you're low on memory, making your disk cache too small. When you do a replication, the simple act of copying the data shoves the current index completely out of RAM, so when you do a query, it has to go back to the disk (which is now VERY busy) to satisfy it. Unless you know for sure that you need 10GB dedicated to the JVM, go with much smaller values, because out of the 12GB available, that will only leave you about 1.5GB, assuming the machine has no GUI and no other processes. If you need the JVM that large because you have very large Solr caches, consider reducing their size dramatically. In deciding whether to use precious memory for the OS disk cache or Solr caches, the OS should go first. Additionally, If you have large Solr caches with a small disk cache and configure large autowarm counts, you end up with extremely long commit times. I don't know how the 30GB of data in your index is distributed among the various Lucene files, but for an index that size, I'd want to have between 8GB and 16GB of RAM available to the OS just for disk caching, and if more is possible, even better. If you could get more than 32GB of RAM in the server, your entire index would fit, and it would be very fast. With a little research, I came up (on my own) with what I think is a decent rule of thumb, and I'm curious what the experts think of this idea: Find out how much space is taken by the index files with the following extensions: fnm, fdx, frq, nrm, tii, tis, and tvx. Think of that as a bare minimum disk cache size, then shoot for between 1.5 and 3 times that value for your disk cache, so it can also cache parts of the other files. Thanks, Shawn Ditto on that question
Solr Cloud Architecture and DIH
We're currently running Solr 3.5 and our indexing process works as follows: We have a master that has a cron job to run a delta import via DIH every 5 minutes. The delta-import takes around 75 minutes to full complete, most of that is due to optimization after each delta and then the slaves sync up. Our index is around 30 gigs so after delta-importing it takes a few minutes to sync to each slave and causes a huge increase in disk I/O and thus slowing down the machine to an unusable state. To get around this we have a rolling upgrade process whereas one slave at a time takes itself offline and then syncs and then brings itself back up. Gross… i know. When we want to run a full-import, which could take upwards of 30 hours, we run it on a separate solr master while the first solr master continues to delta-import. When the staging solr master is finally done importing we copy over the index to the main solr master which will then sync up with the slaves. This has been working for us but it obviously has it flaws. I've been looking into completely re-writing our architecture to utilize Solr Cloud to help us with some of these pain points, if it makes sense. Please let me know how Solr 4.0 and Solr Cloud could help. I also have the following questions. Does DIH work with Solr Cloud? Can Solr Cloud utilize the whole cluster to index in parallel to remove the burden of one machine from performing that task. If so, how is it balanced across all nodes? Can this work with DIH When we decide to run a full-import how can we due this and not affect our existing cluster since there is no real master/slave and obviously no staging "master"? Thanks in advance! - M
Re: Need help with graphing function (MATH)
Thanks I'll have a look at this. I should have mentioned that the actual values on the graph aren't important rather I was showing an example of how the function should behave. On 2/13/12 6:25 PM, Kent Fitch wrote: Hi, assuming you have x and want to generate y, then maybe - if x < 50, y = 150 - if x > 175, y = 60 - otherwise : either y = (100/(e^((x -50)/75)^2)) + 50 http://www.wolframalpha.com/input/?i=plot++%28100%2F%28e^%28%28x+-50%29%2F75%29^2%29%29+%2B+50%2C+x%3D50..175 - or maybe y =sin((x+5)/38)*42+105 http://www.wolframalpha.com/input/?i=plot++sin%28%28x%2B5%29%2F38%29*42%2B105%2C+x%3D50..175 Regards, Kent Fitch On Tue, Feb 14, 2012 at 12:29 PM, Mark <mailto:static.void@gmail.com>> wrote: I need some help with one of my boost functions. I would like the function to look something like the following mockup below. Starts off flat then there is a gradual decline, steep decline then gradual decline and then back to flat. Can some of you math guys please help :) Thanks.
Re: Need help with graphing function (MATH)
Would you mind throwing out an example of these types of functions. Looking at Wikipedia (http://en.wikipedia.org/wiki/Probit) its seems like the Probit function is very similar to what I want. Thanks On 2/14/12 10:56 AM, Ted Dunning wrote: In general this kind of function is very easy to construct using sums of basic sigmoidal functions. The logistic and probit functions are commonly used for this. Sent from my iPhone On Feb 14, 2012, at 10:05, Mark wrote: Thanks I'll have a look at this. I should have mentioned that the actual values on the graph aren't important rather I was showing an example of how the function should behave. On 2/13/12 6:25 PM, Kent Fitch wrote: Hi, assuming you have x and want to generate y, then maybe - if x< 50, y = 150 - if x> 175, y = 60 - otherwise : either y = (100/(e^((x -50)/75)^2)) + 50 http://www.wolframalpha.com/input/?i=plot++%28100%2F%28e^%28%28x+-50%29%2F75%29^2%29%29+%2B+50%2C+x%3D50..175 - or maybe y =sin((x+5)/38)*42+105 http://www.wolframalpha.com/input/?i=plot++sin%28%28x%2B5%29%2F38%29*42%2B105%2C+x%3D50..175 Regards, Kent Fitch On Tue, Feb 14, 2012 at 12:29 PM, Markmailto:static.void@gmail.com>> wrote: I need some help with one of my boost functions. I would like the function to look something like the following mockup below. Starts off flat then there is a gradual decline, steep decline then gradual decline and then back to flat. Can some of you math guys please help :) Thanks.
Re: Need help with graphing function (MATH)
Or better yet an example in solr would be best :) Thanks! On 2/14/12 11:05 AM, Mark wrote: Would you mind throwing out an example of these types of functions. Looking at Wikipedia (http://en.wikipedia.org/wiki/Probit) its seems like the Probit function is very similar to what I want. Thanks On 2/14/12 10:56 AM, Ted Dunning wrote: In general this kind of function is very easy to construct using sums of basic sigmoidal functions. The logistic and probit functions are commonly used for this. Sent from my iPhone On Feb 14, 2012, at 10:05, Mark wrote: Thanks I'll have a look at this. I should have mentioned that the actual values on the graph aren't important rather I was showing an example of how the function should behave. On 2/13/12 6:25 PM, Kent Fitch wrote: Hi, assuming you have x and want to generate y, then maybe - if x< 50, y = 150 - if x> 175, y = 60 - otherwise : either y = (100/(e^((x -50)/75)^2)) + 50 http://www.wolframalpha.com/input/?i=plot++%28100%2F%28e^%28%28x+-50%29%2F75%29^2%29%29+%2B+50%2C+x%3D50..175 - or maybe y =sin((x+5)/38)*42+105 http://www.wolframalpha.com/input/?i=plot++sin%28%28x%2B5%29%2F38%29*42%2B105%2C+x%3D50..175 Regards, Kent Fitch On Tue, Feb 14, 2012 at 12:29 PM, Markmailto:static.void@gmail.com>> wrote: I need some help with one of my boost functions. I would like the function to look something like the following mockup below. Starts off flat then there is a gradual decline, steep decline then gradual decline and then back to flat. Can some of you math guys please help :) Thanks.
Question on replication
After I perform a delta-import on my master the slave replicates the whole index which can be quite time consuming. Is there any way for the slave to replicate only partials that have changed? Do I need to change some setting on master not to commit/optimize to get this to work? Thanks
Solr DataImportHandler (DIH) and Cassandra
Is there anyway to use DIH to import from Cassandra? Thanks
Re: Solr DataImportHandler (DIH) and Cassandra
The DataSource subclass route is what I will probably be interested in. Are there are working examples of this already out there? On 11/29/10 12:32 PM, Aaron Morton wrote: AFAIK there is nothing pre-written to pull the data out for you. You should be able to create your DataSource sub class http://lucene.apache.org/solr/api/org/apache/solr/handler/dataimport/DataSource.html Using the Hector java library to pull data from Cassandra. I'm guessing you will need to consider how to perform delta imports. Perhaps using the secondary indexes in 0.7* , or maintaining your own queues or indexes to know what has changed. There is also the Lucandra project, not exactly what your after but may be of interest anyway https://github.com/tjake/Lucandra Hope that helps. Aaron On 30 Nov, 2010,at 05:04 AM, Mark wrote: Is there anyway to use DIH to import from Cassandra? Thanks
Limit number of characters returned
Is there way to limit the number of characters returned from a stored field? For example: Say I have a document (~2K words) and I search for a word that's somewhere in the middle. I would like the document to match the search query but the stored field should only return the first 200 characters of the document. Is there anyway to accomplish this that doesn't involve two fields? Thanks
Re: Limit number of characters returned
Correct me if I am wrong but I would like to return highlighted excerpts from the document so I would still need to index and store the whole document right (ie.. highlighting only works on stored fields)? On 12/3/10 3:51 AM, Ahmet Arslan wrote: --- On Fri, 12/3/10, Mark wrote: From: Mark Subject: Limit number of characters returned To: solr-user@lucene.apache.org Date: Friday, December 3, 2010, 5:39 AM Is there way to limit the number of characters returned from a stored field? For example: Say I have a document (~2K words) and I search for a word that's somewhere in the middle. I would like the document to match the search query but the stored field should only return the first 200 characters of the document. Is there anyway to accomplish this that doesn't involve two fields? I don't think it is possible out-of-the-box. May be you can hack highlighter to return that first 200 characters in highlighting response. Or a custom response writer can do that. But if you will be always returning first 200 characters of documents, I think creating additional field with indexed="false" stored="true" will be more efficient. And you can make your original field indexed="true" stored="false", your index size will be diminished.
Negative fl param
When returning results is there a way I can say to return all fields except a certain one? So say I have stored fields foo, bar and baz but I only want to return foo and bar. Is it possible to do this without specifically listing out the fields I do want?
Re: Limit number of characters returned
Thanks for the response. Couldn't I just use the highlighter and configure it to use the alternative field to return the first 200 characters? In cases where there is a highlighter match I would prefer to show the excerpts anyway. http://wiki.apache.org/solr/HighlightingParameters#hl.alternateField http://wiki.apache.org/solr/HighlightingParameters#hl.maxAlternateFieldLength Is this something wrong with this method? On 12/3/10 8:03 AM, Erick Erickson wrote: Yep, you're correct. CopyField is probably your simplest option here as Ahmet suggested. A more complex solution would be your own response writer, but unless and until you index gets cumbersome, I'd avoid that. Plus, storing the copied contents only shouldn't impact search much, since this doesn't add any terms... Best Erick On Fri, Dec 3, 2010 at 10:32 AM, Mark wrote: Correct me if I am wrong but I would like to return highlighted excerpts from the document so I would still need to index and store the whole document right (ie.. highlighting only works on stored fields)? On 12/3/10 3:51 AM, Ahmet Arslan wrote: --- On Fri, 12/3/10, Mark wrote: From: Mark Subject: Limit number of characters returned To: solr-user@lucene.apache.org Date: Friday, December 3, 2010, 5:39 AM Is there way to limit the number of characters returned from a stored field? For example: Say I have a document (~2K words) and I search for a word that's somewhere in the middle. I would like the document to match the search query but the stored field should only return the first 200 characters of the document. Is there anyway to accomplish this that doesn't involve two fields? I don't think it is possible out-of-the-box. May be you can hack highlighter to return that first 200 characters in highlighting response. Or a custom response writer can do that. But if you will be always returning first 200 characters of documents, I think creating additional field with indexed="false" stored="true" will be more efficient. And you can make your original field indexed="true" stored="false", your index size will be diminished.
Re: Negative fl param
Ok simple enough. I just created a SearchComponent that removes values from the fl param. On 12/3/10 9:32 AM, Ahmet Arslan wrote: When returning results is there a way I can say to return all fields except a certain one? So say I have stored fields foo, bar and baz but I only want to return foo and bar. Is it possible to do this without specifically listing out the fields I do want? There were a similar discussion. http://search-lucene.com/m/2qJaU1wImo3/ A workaround can be getting all stored field names from http://wiki.apache.org/solr/LukeRequestHandler and construct fl accordingly.
Highlighting parameters
Is there a way I can specify separate configuration for 2 different fields. For field 1 I wan to display only 100 chars, Field 2 200 chars
Solr Newbie - need a point in the right direction
Hi, First time poster here - I'm not entirely sure where I need to look for this information. What I'm trying to do is extract some (presumably) structured information from non-uniform data (eg, prices from a nutch crawl) that needs to show in search queries, and I've come up against a wall. I've been unable to figure out where is the best place to begin. I had a look through the solr wiki and did a search via Lucid's search tool and I'm guessing this is handled at index time through my schema? But I've also seen dismax being thrown around as a possible solution and this has confused me. Basically, if you guys could point me in the right direction for resources (even as much as saying, you need X, it's over there) that would be a huge help. Cheers Mark
Re: Solr Newbie - need a point in the right direction
Thanks to everyone who responded, no wonder I was getting confused, I was completely focusing on the wrong half of the equation. I had a cursory look through some of the Nutch documentation available and it is looking promising. Thanks everyone. Mark On Tue, Dec 7, 2010 at 10:19 PM, webdev1977 wrote: > > I my experience, the hardest (but most flexible part) is exactly what was > mentioned.. processing the data. Nutch does have a really easy plugin > interface that you can use, and the example plugin is a great place to > start. Once you have the raw parsed text, you can do what ever you want > with it. For example, I wrote a plugin to add geospatial information to > my > NutchDocument. You then map the fields you added in the NutchDocument to > something you want to have Solr index. In my case I created a geography > field where I put lat, lon info. Then you create that same geography field > in the nutch to solr mapping file as well as your solr schema.xml file. > Then, when you run the crawl and tell it to use "solrindex" it will send > the > document to solr to be indexed. Since you have your new field in the > schema, it knows what to do with it at index time. Now you can build a > user > interface around what you want to do with that field. > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-Newbie-need-a-point-in-the-right-direction-tp2031381p2033687.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Warming searchers/Caching
Is there any plugin or easy way to auto-warm/cache a new searcher with a bunch of searches read from a file? I know this can be accomplished using the EventListeners (newSearcher, firstSearcher) but I rather not add 100+ queries to my solrconfig.xml. If there is no hook/listener available, is there some sort of Handler that performs this sort of function? Thanks!
Re: Warming searchers/Caching
Maybe I should explain my problem a little more in detail. The problem we are experiencing is after a delta-import we notice a extremely high load time on the slave machines that just replicated. It goes away after a min or so production traffic once everything is cached. I already have a before/after hook that is in place before/after replication takes place. The before hook removes the slave from the cluster and then starts to replicate. When its done it calls the after hook and I would like to warm up the cache in this method so no users experience extremely long wait times. On 12/7/10 4:22 PM, Markus Jelsma wrote: XInclude works fine but that's not what your looking for i guess. Having the 100 top queries is overkill anyway and it can take too long for a new searcher to warmup. Depending on the type of requests, i usually tend to limit warming to popular filter queries only as they generate a very high hit ratio at make caching useful [1]. If there are very popular user entered queries having a high initial latency, i'd have them warmed up as well. [1]: http://wiki.apache.org/solr/SolrCaching#Tradeoffs Warning: I haven't used this personally, but Xinclude looks like what you're after, see: http://wiki.apache.org/solr/SolrConfigXml#XInclude Best Erick On Tue, Dec 7, 2010 at 6:33 PM, Mark wrote: Is there any plugin or easy way to auto-warm/cache a new searcher with a bunch of searches read from a file? I know this can be accomplished using the EventListeners (newSearcher, firstSearcher) but I rather not add 100+ queries to my solrconfig.xml. If there is no hook/listener available, is there some sort of Handler that performs this sort of function? Thanks!
Re: Warming searchers/Caching
I am using 1.4.1. What am I doing that Solr already provides? Thanks for you help On 12/8/10 5:10 AM, Erick Erickson wrote: What version of Solr are you using? Because it seems like you're doing a lot of stuff that Solr already does for you automatically So perhaps a more complete statement of your setup is in order, since we seem to be talking past each other. Best Erick On Tue, Dec 7, 2010 at 10:24 PM, Mark wrote: Maybe I should explain my problem a little more in detail. The problem we are experiencing is after a delta-import we notice a extremely high load time on the slave machines that just replicated. It goes away after a min or so production traffic once everything is cached. I already have a before/after hook that is in place before/after replication takes place. The before hook removes the slave from the cluster and then starts to replicate. When its done it calls the after hook and I would like to warm up the cache in this method so no users experience extremely long wait times. On 12/7/10 4:22 PM, Markus Jelsma wrote: XInclude works fine but that's not what your looking for i guess. Having the 100 top queries is overkill anyway and it can take too long for a new searcher to warmup. Depending on the type of requests, i usually tend to limit warming to popular filter queries only as they generate a very high hit ratio at make caching useful [1]. If there are very popular user entered queries having a high initial latency, i'd have them warmed up as well. [1]: http://wiki.apache.org/solr/SolrCaching#Tradeoffs Warning: I haven't used this personally, but Xinclude looks like what you're after, see: http://wiki.apache.org/solr/SolrConfigXml#XInclude Best Erick On Tue, Dec 7, 2010 at 6:33 PM, Mark wrote: Is there any plugin or easy way to auto-warm/cache a new searcher with a bunch of searches read from a file? I know this can be accomplished using the EventListeners (newSearcher, firstSearcher) but I rather not add 100+ queries to my solrconfig.xml. If there is no hook/listener available, is there some sort of Handler that performs this sort of function? Thanks!
Re: Warming searchers/Caching
We only replicate twice an hour so we are far from real-time indexing. Our application never writes to master rather we just pick up all changes using updated_at timestamps when delta-importing using DIH. We don't have any warming queries in firstSearcher/newSearcher event listeners. My initial post was asking how I would go about doing this with a large number of queries. Our queries themselves tend to have a lot of faceting and other restrictions on them so I would rather not list them all out using xml. I was hoping there was some sort of log replayer handler or class that would replay a bunch of queries while the node is offline. When its done, it will bring the node back online ready to serve requests. On 12/8/10 6:15 AM, Jonathan Rochkind wrote: How often do you replicate? Do you know how long your warming queries take to complete? As others in this thread have mentioned, if your replications (or ordinary commits, if you weren't using replication) happen quicker than warming takes to complete, you can get overlapping indexes being warmed up, and run out of RAM (causing garbage collection to take lots of CPU, if not an out-of-memory error), or otherwise block on CPU with lots of new indexes being warmed at once. Solr is not very good at providing 'real time indexing' for this reason, although I believe there are some features in post-1.4 trunk meant to support 'near real time search' better. ____ From: Mark [static.void@gmail.com] Sent: Tuesday, December 07, 2010 10:24 PM To: solr-user@lucene.apache.org Subject: Re: Warming searchers/Caching Maybe I should explain my problem a little more in detail. The problem we are experiencing is after a delta-import we notice a extremely high load time on the slave machines that just replicated. It goes away after a min or so production traffic once everything is cached. I already have a before/after hook that is in place before/after replication takes place. The before hook removes the slave from the cluster and then starts to replicate. When its done it calls the after hook and I would like to warm up the cache in this method so no users experience extremely long wait times. On 12/7/10 4:22 PM, Markus Jelsma wrote: XInclude works fine but that's not what your looking for i guess. Having the 100 top queries is overkill anyway and it can take too long for a new searcher to warmup. Depending on the type of requests, i usually tend to limit warming to popular filter queries only as they generate a very high hit ratio at make caching useful [1]. If there are very popular user entered queries having a high initial latency, i'd have them warmed up as well. [1]: http://wiki.apache.org/solr/SolrCaching#Tradeoffs Warning: I haven't used this personally, but Xinclude looks like what you're after, see: http://wiki.apache.org/solr/SolrConfigXml#XInclude Best Erick On Tue, Dec 7, 2010 at 6:33 PM, Mark wrote: Is there any plugin or easy way to auto-warm/cache a new searcher with a bunch of searches read from a file? I know this can be accomplished using the EventListeners (newSearcher, firstSearcher) but I rather not add 100+ queries to my solrconfig.xml. If there is no hook/listener available, is there some sort of Handler that performs this sort of function? Thanks!
Re: Warming searchers/Caching
I actually built in the before/after hooks so we can disable/enable a node from the cluster while its replicating. When the machine was copying over 20gigs and serving requests the load spiked tremendously. It was easy enough to create a sort of rolling replication... ie, 1) node 1 removes health-check file, replicates then goes back up 2) node 2 removes health-check file, replicates then goes back up, ... Which listener gets called after replication... im guessing newSearcher? Thanks for you help On 12/8/10 10:18 AM, Erick Erickson wrote: Perhaps the tricky part here is that Solr makes it's caches for #parts# of the query. In other words, a query that sorts on field A will populate the cache for field A. Any other query that sorts on field A will use the same cache. So you really need just enough queries to populate, in this case, the fields you'll sort by. One could put together multiple sorts on a single query and populate the sort caches all at once if you wanted. Similarly for faceting and filter queries. You might well be able to make just a few queries that filled up all the relevant caches rather than the using 100s, but you know your schema way better than I do. What I meant about replicating work is that trying to use your after hook to fire off the queries probably doesn't buy you anything over firstSearcher/newSearcher lists. All that said, though, if you really don't want to put your queries in the config file, it would be relatively trivial to write a small Java app that uses SolrJ to query the server, reading the queries from anyplace you chose and call it from the after hook. Personally, I think this is a high-cost option when compared to having the list in the config file due to the added complexity, but that's your call. Best Erick On Wed, Dec 8, 2010 at 12:25 PM, Mark wrote: We only replicate twice an hour so we are far from real-time indexing. Our application never writes to master rather we just pick up all changes using updated_at timestamps when delta-importing using DIH. We don't have any warming queries in firstSearcher/newSearcher event listeners. My initial post was asking how I would go about doing this with a large number of queries. Our queries themselves tend to have a lot of faceting and other restrictions on them so I would rather not list them all out using xml. I was hoping there was some sort of log replayer handler or class that would replay a bunch of queries while the node is offline. When its done, it will bring the node back online ready to serve requests. On 12/8/10 6:15 AM, Jonathan Rochkind wrote: How often do you replicate? Do you know how long your warming queries take to complete? As others in this thread have mentioned, if your replications (or ordinary commits, if you weren't using replication) happen quicker than warming takes to complete, you can get overlapping indexes being warmed up, and run out of RAM (causing garbage collection to take lots of CPU, if not an out-of-memory error), or otherwise block on CPU with lots of new indexes being warmed at once. Solr is not very good at providing 'real time indexing' for this reason, although I believe there are some features in post-1.4 trunk meant to support 'near real time search' better. From: Mark [static.void@gmail.com] Sent: Tuesday, December 07, 2010 10:24 PM To: solr-user@lucene.apache.org Subject: Re: Warming searchers/Caching Maybe I should explain my problem a little more in detail. The problem we are experiencing is after a delta-import we notice a extremely high load time on the slave machines that just replicated. It goes away after a min or so production traffic once everything is cached. I already have a before/after hook that is in place before/after replication takes place. The before hook removes the slave from the cluster and then starts to replicate. When its done it calls the after hook and I would like to warm up the cache in this method so no users experience extremely long wait times. On 12/7/10 4:22 PM, Markus Jelsma wrote: XInclude works fine but that's not what your looking for i guess. Having the 100 top queries is overkill anyway and it can take too long for a new searcher to warmup. Depending on the type of requests, i usually tend to limit warming to popular filter queries only as they generate a very high hit ratio at make caching useful [1]. If there are very popular user entered queries having a high initial latency, i'd have them warmed up as well. [1]: http://wiki.apache.org/solr/SolrCaching#Tradeoffs Warning: I haven't used this personally, but Xinclude looks like what you're after, see: http://wiki.apache.org/solr/SolrConfigXml#XInclude Best Erick On Tue, Dec 7, 2010 at 6:33 PM, Mark wrote: Is there any plugin or easy way to auto-warm/cache a new searcher with a bunch of searches read from a file? I know this ca
Re: Warming searchers/Caching
Our machines have around 8gb of ram and our index is 25gb. What are some good values for those cache settings. Looks like we have the defaults in place... size="16384" initialSize="4096" autowarmCount="1024" You are correct, I am just removing the health-check file and our loadbalancer prevents any traffic from reaching those nodes while they are replicating. On 12/8/10 4:41 PM, Chris Hostetter wrote: : What am I doing that Solr already provides? the one thing i haven't seen mentioned anywhere in this thread is what you have the "autoWarmCount" value set to on all of the various solr internal caches (as seen in your solrconfig.xml) if that's set, you don't need to manually feed solr any special queries, it will warm them automaticly when a newSearcher is opened. this assumes of course that the SolrCore has old caches to warm forom -- ie: if you use normal replication with an existing SolrCore. You've made refrences to taking slaves out of clusters using before/after hooks of your own creation -- as long as this is just stoping traffic from reaching the slave then auto warning should work fine for you -- if you are actually shutingdown the SolrCore and starting up a new one, then it wont -- and you are probably making extra work for yourself. -Hoss
Very high load after replicating
After replicating an index of around 20g my slaves experience very high load (50+!!) Is there anything I can do to alleviate this problem? Would solr cloud be of any help? thanks
Re: Very high load after replicating
Markus, My configuration is as follows... ... false 2 ... false 64 10 false true No cache warming queries and our machines have 8g of memory in them with about 5120m of ram dedicated to so Solr. When our index is around 10-11g in size everything runs smoothly. At around 20g+ it just falls apart. Can you (or anyone) provide some suggestions? Thanks On 12/12/10 1:11 PM, Markus Jelsma wrote: There can be numerous explanations such as your configuration (cache warm queries, merge factor, replication events etc) but also I/O having trouble flushing everything to disk. It could also be a memory problem, the OS might start swapping if you allocate too much RAM to the JVM leaving little for the OS to work with. You need to provide more details. After replicating an index of around 20g my slaves experience very high load (50+!!) Is there anything I can do to alleviate this problem? Would solr cloud be of any help? thanks
Re: Very high load
Changing the subject. Its not related to after replication. It only appeared after indexing an extra field which increased our index size from 12g to 20g+ On 12/13/10 7:57 AM, Mark wrote: Markus, My configuration is as follows... ... false 2 ... false 64 10 false true No cache warming queries and our machines have 8g of memory in them with about 5120m of ram dedicated to so Solr. When our index is around 10-11g in size everything runs smoothly. At around 20g+ it just falls apart. Can you (or anyone) provide some suggestions? Thanks On 12/12/10 1:11 PM, Markus Jelsma wrote: There can be numerous explanations such as your configuration (cache warm queries, merge factor, replication events etc) but also I/O having trouble flushing everything to disk. It could also be a memory problem, the OS might start swapping if you allocate too much RAM to the JVM leaving little for the OS to work with. You need to provide more details. After replicating an index of around 20g my slaves experience very high load (50+!!) Is there anything I can do to alleviate this problem? Would solr cloud be of any help? thanks
Need some guidance on solr-config settings
Can anyone offer some advice on what some good settings would be for an index or around 6 million documents totaling around 20-25gb? It seems like when our index gets to this size our CPU load spikes tremendously. What would be some appropriate settings for ramBufferSize and mergeFactor? We currently have: 10 64 Same question on cache settings. We currently have: false 2 Are there any other settings that I could tweak to affect performance? Thanks
Re: Need some guidance on solr-config settings
Excellent reply. You mentioned: "I've been experimenting with FastLRUCache versus LRUCache, because I read that below a certain hitratio, the latter is better." Do you happen to remember what that threshold is? Thanks On 12/14/10 7:59 AM, Shawn Heisey wrote: On 12/14/2010 8:31 AM, Mark wrote: Can anyone offer some advice on what some good settings would be for an index or around 6 million documents totaling around 20-25gb? It seems like when our index gets to this size our CPU load spikes tremendously. If you are adding, deleting, or updating documents on a regular basis, I would bet that it's your autoWarmCount. You've told it that whenever you do a commit, it needs to make up to 32768 queries against the new index. That's very intense and time-consuming. If you are also optimizing the index, the problem gets even worse. On the documentCache, autowarm doesn't happen, so the 16384 specified there isn't actually doing anything. Below are my settings. I originally had much larger caches with equally large autoWarmCounts ... reducing them to this level was the only way I could get my autowarm time below 30 seconds on each index. If you go to the admin page for your index and click on Statistics, then search for "warmupTime" you'll see how long it took to do the queries. Later on the page you'll also see this broken down on each cache. Since I made the changes, performance is actually better now, not worse. I have been experimenting with FastLRUCache versus LRUCache, because I read that below a certain hitratio, the latter is better. I've got 8 million documents in each shard, taking up about 15GB. My mergeFactor is 16 and my ramBufferSize is 256MB. These really only come into play when I do a full re-index, which is rare.
DIH and UTF-8
Seems like I am missing some configuration when trying to use DIH to import documents with chinese characters. All the documents save crazy nonsense like "这是测试" instead of actual chinese characters. I think its at the JDBC level because if I hardcode one of the fields within data-config.xml (using a template transformer) the characters show up correctly. Any ideas? Thanks
Re: DIH and UTF-8
Solr: 1.4.1 JDBC driver: Connector/J 5.1.14 Looks like its the JDBC driver because It doesn't even work with a simple java program. I know this is a little off subject now, but do you have any clues? Thanks again On 12/27/10 1:58 PM, Erick Erickson wrote: More data please. Which jdbc driver? Have you tried just printing out the results of using that driver in a simple Java program? Solr should handle UTF-8 just fine, but the servlet container may have to have some settings tweaked, which one of those are you using? What version of Solr? Best Erick On Mon, Dec 27, 2010 at 3:05 PM, Mark wrote: Seems like I am missing some configuration when trying to use DIH to import documents with chinese characters. All the documents save crazy nonsense like "这是测试" instead of actual chinese characters. I think its at the JDBC level because if I hardcode one of the fields within data-config.xml (using a template transformer) the characters show up correctly. Any ideas? Thanks
Re: DIH and UTF-8
I tried both of those with no such luck. On 12/27/10 2:49 PM, Glen Newton wrote: 1 - Verify your mysql is set up using UTF-8 2 - Does your JDBC connect string contain: useUnicode=true&characterEncoding=UTF-8 See: http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-charsets.html Glen http://zzzoot.blogspot.com/ On Mon, Dec 27, 2010 at 5:15 PM, Mark wrote: Solr: 1.4.1 JDBC driver: Connector/J 5.1.14 Looks like its the JDBC driver because It doesn't even work with a simple java program. I know this is a little off subject now, but do you have any clues? Thanks again On 12/27/10 1:58 PM, Erick Erickson wrote: More data please. Which jdbc driver? Have you tried just printing out the results of using that driver in a simple Java program? Solr should handle UTF-8 just fine, but the servlet container may have to have some settings tweaked, which one of those are you using? What version of Solr? Best Erick On Mon, Dec 27, 2010 at 3:05 PM, Markwrote: Seems like I am missing some configuration when trying to use DIH to import documents with chinese characters. All the documents save crazy nonsense like "这是测试" instead of actual chinese characters. I think its at the JDBC level because if I hardcode one of the fields within data-config.xml (using a template transformer) the characters show up correctly. Any ideas? Thanks
Re: DIH and UTF-8
Just like the user of that thread... i have my database, table, columns and system variables all set but it still doesnt work as expected. Server version: 5.0.67 Source distribution Type 'help;' or '\h' for help. Type '\c' to clear the buffer. mysql> SHOW VARIABLES LIKE 'collation%'; +--+-+ | Variable_name| Value | +--+-+ | collation_connection | utf8_general_ci | | collation_database | utf8_general_ci | | collation_server | utf8_general_ci | +--+-+ 3 rows in set (0.00 sec) mysql> SHOW VARIABLES LIKE 'character_set%'; +--++ | Variable_name| Value | +--++ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results| utf8 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | /usr/local/mysql/share/mysql/charsets/ | +--++ 8 rows in set (0.00 sec) Any other ideas? Thanks On 12/27/10 3:23 PM, Glen Newton wrote: [client] > default-character-set = utf8 > [mysql] > default-character-set=utf8 > [mysqld] > character_set_server = utf8 > character_set_client = utf8
Re: DIH and UTF-8
It was due to the way I was writing to the DB using our rails application. Everythin looked correct but when retrieving it using the JDBC driver it was all managled. On 12/27/10 4:38 PM, Glen Newton wrote: Is it possible your browser is not set up to properly display the chinese characters? (I am assuming you are looking at things through your browser) Do you have any problems viewing other chinese documents properly in your browser? Using mysql, can you see these characters properly? What happens when you use curl or wget to get a document from solr and looking at it using something besides your browser? Yes, I am running out of ideas! :-) -Glen On Mon, Dec 27, 2010 at 7:22 PM, Mark wrote: Just like the user of that thread... i have my database, table, columns and system variables all set but it still doesnt work as expected. Server version: 5.0.67 Source distribution Type 'help;' or '\h' for help. Type '\c' to clear the buffer. mysql> SHOW VARIABLES LIKE 'collation%'; +--+-+ | Variable_name| Value | +--+-+ | collation_connection | utf8_general_ci | | collation_database | utf8_general_ci | | collation_server | utf8_general_ci | +--+-+ 3 rows in set (0.00 sec) mysql> SHOW VARIABLES LIKE 'character_set%'; +--++ | Variable_name| Value | +--++ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results| utf8 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | /usr/local/mysql/share/mysql/charsets/ | +--++ 8 rows in set (0.00 sec) Any other ideas? Thanks On 12/27/10 3:23 PM, Glen Newton wrote: [client] default-character-set = utf8 [mysql] default-character-set=utf8 [mysqld] character_set_server = utf8 character_set_client = utf8
Dynamic column names using DIH
Is there a way to create dynamic column names using the values returned from the query? For example:
Re: DIH and UTF-8
Sure thing. In my database.yml I was missing the "encoding: utf8" option. If one were to add unicode characters within rails (console, web form, etc) the characters would appear to be saved correctly... ie when trying to retrieve them back, everything looked perfect. The characters also appeared correctly using the mysql prompt. However when trying to index or retrieve those characters using JDBC/Solr the characters were mangled. After adding the above utf8 encoding option I was able to correctly save utf8 characters into the database and retrieve them using JDBC/Solr. However when using the mysql client all the characters would show up as all mangled or as ''. This was resolved by running the following query "set names utf8;". On 12/28/10 10:17 PM, Glen Newton wrote: Hi Mark, Could you offer a more technical explanation of the Rails problem, so that if others encounter a similar problem your efforts in finding the issue will be available to them? :-) Thanks, Glen PS. This has wandered somewhat off-topic to this list: apologies& thanks for the patience of this list... On Tue, Dec 28, 2010 at 4:15 PM, Mark wrote: It was due to the way I was writing to the DB using our rails application. Everythin looked correct but when retrieving it using the JDBC driver it was all managled. On 12/27/10 4:38 PM, Glen Newton wrote: Is it possible your browser is not set up to properly display the chinese characters? (I am assuming you are looking at things through your browser) Do you have any problems viewing other chinese documents properly in your browser? Using mysql, can you see these characters properly? What happens when you use curl or wget to get a document from solr and looking at it using something besides your browser? Yes, I am running out of ideas! :-) -Glen On Mon, Dec 27, 2010 at 7:22 PM, Markwrote: Just like the user of that thread... i have my database, table, columns and system variables all set but it still doesnt work as expected. Server version: 5.0.67 Source distribution Type 'help;' or '\h' for help. Type '\c' to clear the buffer. mysql>SHOW VARIABLES LIKE 'collation%'; +--+-+ | Variable_name| Value | +--+-+ | collation_connection | utf8_general_ci | | collation_database | utf8_general_ci | | collation_server | utf8_general_ci | +--+-+ 3 rows in set (0.00 sec) mysql>SHOW VARIABLES LIKE 'character_set%'; +--++ | Variable_name| Value | +--++ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results| utf8 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | /usr/local/mysql/share/mysql/charsets/ | +--++ 8 rows in set (0.00 sec) Any other ideas? Thanks On 12/27/10 3:23 PM, Glen Newton wrote: [client] default-character-set = utf8 [mysql] default-character-set=utf8 [mysqld] character_set_server = utf8 character_set_client = utf8
Query multiple cores
Is it possible to query across multiple cores and combine the results? If not available out-of-the-box could this be accomplished using some sort of custom request handler? Thanks for any suggestions.
Re: Query multiple cores
I own the book already Smiley :) I'm somewhat familiar with this feature but I wouldn't be searching across multiple machines. I would like to search across two separate cores on the same machine. Is distributed search the same as Solr cloud? When would one choose one over the other? On 12/29/10 12:34 PM, Smiley, David W. wrote: I recommend looking for answers on the wiki (or my book) before asking basic questions on the list. Here you go: http://wiki.apache.org/solr/DistributedSearch ~ David Smiley Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/ On Dec 29, 2010, at 3:24 PM, Mark wrote: Is it possible to query across multiple cores and combine the results? If not available out-of-the-box could this be accomplished using some sort of custom request handler? Thanks for any suggestions.
Question on long delta import
When using DIH my delta imports appear to finish quickly.. ie it says "Indexing completed. Added/Updated: 95491 documents. Deleted 11148 documents." in a relatively short amount of time (~30mins). However the importMessage says "A command is still running..." for a really long time (~60mins). What is happening during this phase and how could I speed this up? Thanks!
DIH MySQLNonTransientConnectionException
I have recently been receiving the following errors during my DIH importing. Has anyone ran into this issue before? Know how to resolve it? Thanks! Jan 1, 2011 4:51:06 PM org.apache.solr.handler.dataimport.JdbcDataSource closeConnection SEVERE: Ignoring Error when closing connection com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: Communications link failure during rollback(). Transaction resolution unknown. at sun.reflect.GeneratedConstructorAccessor29.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:532) at com.mysql.jdbc.Util.handleNewInstance(Util.java:407) at com.mysql.jdbc.Util.getInstance(Util.java:382) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1013) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927) at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:4751) at com.mysql.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:4345) at com.mysql.jdbc.ConnectionImpl.close(ConnectionImpl.java:1564) at org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:399) at org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:390) at org.apache.solr.handler.dataimport.DataConfig$Entity.clearCache(DataConfig.java:174) at org.apache.solr.handler.dataimport.DataConfig$Entity.clearCache(DataConfig.java:165) at org.apache.solr.handler.dataimport.DataConfig.clearCaches(DataConfig.java:332) at org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:360) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370) Jan 1, 2011 4:51:06 PM org.apache.solr.handler.dataimport.JdbcDataSource closeConnection SEVERE: Ignoring Error when closing connection java.sql.SQLException: Streaming result set com.mysql.jdbc.rowdatadyna...@71f18c82 is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries. at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:934) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:931) at com.mysql.jdbc.MysqlIO.checkForOutstandingStreamingData(MysqlIO.java:2724) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1895) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2140) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2620) at com.mysql.jdbc.ConnectionImpl.rollbackNoChecks(ConnectionImpl.java:4854) at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:4737) at com.mysql.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:4345) at com.mysql.jdbc.ConnectionImpl.close(ConnectionImpl.java:1564) at org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:399) at org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:390) at org.apache.solr.handler.dataimport.DataConfig$Entity.clearCache(DataConfig.java:174) at org.apache.solr.handler.dataimport.DataConfig.clearCaches(DataConfig.java:332) at org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:360) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
DIH keeps felling during full-import
I'm receiving the following exception when trying to perform a full-import (~30 hours). Any idea on ways I could fix this? Is there an easy way to use DIH to break apart a full-import into multiple pieces? IE 3 mini-imports instead of 1 large import? Thanks. Feb 7, 2011 5:52:33 AM org.apache.solr.handler.dataimport.JdbcDataSource closeConnection SEVERE: Ignoring Error when closing connection com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: Communications link failure during rollback(). Transaction resolution unknown. at sun.reflect.GeneratedConstructorAccessor27.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:532) at com.mysql.jdbc.Util.handleNewInstance(Util.java:407) at com.mysql.jdbc.Util.getInstance(Util.java:382) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1013) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927) at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:4751) at com.mysql.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:4345) at com.mysql.jdbc.ConnectionImpl.close(ConnectionImpl.java:1564) at org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:399) at org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:390) at org.apache.solr.handler.dataimport.DataConfig$Entity.clearCache(DataConfig.java:174) at org.apache.solr.handler.dataimport.DataConfig$Entity.clearCache(DataConfig.java:165) at org.apache.solr.handler.dataimport.DataConfig.clearCaches(DataConfig.java:332) at org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:360) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370) Feb 7, 2011 5:52:33 AM org.apache.solr.handler.dataimport.JdbcDataSource closeConnection SEVERE: Ignoring Error when closing connection java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@1a797305 is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries. at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:934) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:931) at com.mysql.jdbc.MysqlIO.checkForOutstandingStreamingData(MysqlIO.java:2724) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1895) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2140) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2620) at com.mysql.jdbc.ConnectionImpl.rollbackNoChecks(ConnectionImpl.java:4854) at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:4737) at com.mysql.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:4345) at com.mysql.jdbc.ConnectionImpl.close(ConnectionImpl.java:1564) at org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:399) at org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:390) at org.apache.solr.handler.dataimport.DataConfig$Entity.clearCache(DataConfig.java:174) at org.apache.solr.handler.dataimport.DataConfig.clearCaches(DataConfig.java:332) at org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:360) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370) Feb 7, 2011 7:03:29 AM org.apache.solr.handler.dataimport.JdbcDataSource closeConnection SEVERE: Ignoring Error when closing connection com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: Communications link failure during rollback(). Transaction resolution unknown. at sun.reflect.GeneratedConstructorAccessor27.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:532) at com.mysql.jdbc.Util.handleNewInstance(Util.java:407) at com.mysql.jdbc.Util.getInstance(Util.java:382) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1013) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927) at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:4751) at com.mysql.jdbc.ConnectionImpl.realCl
Re: DIH keeps failing during full-import
Typo in subject On 2/7/11 7:59 AM, Mark wrote: I'm receiving the following exception when trying to perform a full-import (~30 hours). Any idea on ways I could fix this? Is there an easy way to use DIH to break apart a full-import into multiple pieces? IE 3 mini-imports instead of 1 large import? Thanks. Feb 7, 2011 5:52:33 AM org.apache.solr.handler.dataimport.JdbcDataSource closeConnection SEVERE: Ignoring Error when closing connection com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: Communications link failure during rollback(). Transaction resolution unknown. at sun.reflect.GeneratedConstructorAccessor27.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:532) at com.mysql.jdbc.Util.handleNewInstance(Util.java:407) at com.mysql.jdbc.Util.getInstance(Util.java:382) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1013) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927) at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:4751) at com.mysql.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:4345) at com.mysql.jdbc.ConnectionImpl.close(ConnectionImpl.java:1564) at org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:399) at org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:390) at org.apache.solr.handler.dataimport.DataConfig$Entity.clearCache(DataConfig.java:174) at org.apache.solr.handler.dataimport.DataConfig$Entity.clearCache(DataConfig.java:165) at org.apache.solr.handler.dataimport.DataConfig.clearCaches(DataConfig.java:332) at org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:360) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370) Feb 7, 2011 5:52:33 AM org.apache.solr.handler.dataimport.JdbcDataSource closeConnection SEVERE: Ignoring Error when closing connection java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@1a797305 is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries. at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:934) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:931) at com.mysql.jdbc.MysqlIO.checkForOutstandingStreamingData(MysqlIO.java:2724) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1895) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2140) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2620) at com.mysql.jdbc.ConnectionImpl.rollbackNoChecks(ConnectionImpl.java:4854) at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:4737) at com.mysql.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:4345) at com.mysql.jdbc.ConnectionImpl.close(ConnectionImpl.java:1564) at org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:399) at org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:390) at org.apache.solr.handler.dataimport.DataConfig$Entity.clearCache(DataConfig.java:174) at org.apache.solr.handler.dataimport.DataConfig.clearCaches(DataConfig.java:332) at org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:360) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370) Feb 7, 2011 7:03:29 AM org.apache.solr.handler.dataimport.JdbcDataSource closeConnection SEVERE: Ignoring Error when closing connection com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: Communications link failure during rollback(). Transaction resolution unknown. at sun.reflect.GeneratedConstructorAccessor27.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:532) at com.mysql.jdbc.Util.handleNewInstance(Util.java:407) at com.mysql.jdbc.Util.getInstance(Util.java:382) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1013) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927) at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl
Re: DIH keeps felling during full-import
Full import is around 6M documents which when completed totals around 30GB in size. Im guessing it could be a database connectivity problem because I also see these types of errors on delta-imports which could be anywhere from 20K to 300K records. On 2/7/11 8:15 AM, Gora Mohanty wrote: On Mon, Feb 7, 2011 at 9:29 PM, Mark wrote: I'm receiving the following exception when trying to perform a full-import (~30 hours). Any idea on ways I could fix this? Is there an easy way to use DIH to break apart a full-import into multiple pieces? IE 3 mini-imports instead of 1 large import? Thanks. Feb 7, 2011 5:52:33 AM org.apache.solr.handler.dataimport.JdbcDataSource closeConnection SEVERE: Ignoring Error when closing connection com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: Communications link failure during rollback(). Transaction resolution unknown. [...] This looks like a network issue, or some other failure in communicating with the mysql database. Is that a possibility? Also, how many records are you importing, what is the data size, what is the quality of the network connection, etc.? One way to break up the number of records imported at a time is to shard your data at at the database level, but the advisability of this option depends on whether there is a more fundamental issue. Regards, Gora
DIH threads
Has anyone applied the DIH threads patch on 1.4.1 (https://issues.apache.org/jira/browse/SOLR-1352)? Does anyone know if this works and/or does it improve performance? Thanks
Removing duplicates
I know that I can use the SignatureUpdateProcessorFactory to remove duplicates but I would like the duplicates in the index but remove them conditionally at query time. Is there any easy way I could accomplish this?
Field Collapsing on 1.4.1
Is there a seamless field collapsing patch for 1.4.1? I see it has been merged into trunk but I tried downloading it to give it a whirl but it appears that many things have changed and our application would need some considerable work to get it up an running. Thanks