Re: IndexMergeTool to adhere to TieredMergePolicyFactory settings
Hi, Is it possible for the merging not to merge to one large segment? Regards, Edwin On 28 November 2017 at 12:05, Zheng Lin Edwin Yeo wrote: > Hi, > > I'm currently using Solr 6.5.1. > > I found that in the IndexMergeTool.java, we found that there is this line > which set the maxNumSegments to 1. > > writer.forceMerge(1); > > > For this, does it means that there will always be only 1 segment after the > merging? From what I see, that seems to be the case. > > Is there any way which we can allow the merging to be in multiple segment, > which each segment of a certain size? Like if we want each segment to be of > 20GB, which is what is what is set in the TieredMergePolicyFactory? > > > 10 10 > 20480 > > > Regards, > Edwin >
Solr score use cases
Hi A simple question: what are the most common use cases for the solr score of documents retrieved after firing queries? I dont have a real understanding of its purpose at the moment. Thx for helping
Range facet over currency field
Hi, I have a doubt regarding how to do the range facet on some different currency. I have indexed the price data in USD inside the field price_usd_c and I have currency.xml which is getting generated by a process. If I want to do the range facet on the field price_usd_c in Euro currency, then how could I do it and what is the syntax of it. Is there any way to do so? If so kindly help. Regards, Aman
Passing Metadata from an RTF-file via TIKA to SOLR ...
Hi there! I am quite new to Lucene/Solr/Tika, etc., so I would appreciate you help concerning the following matter. I have a RTF-document, that I want to index in Solr, using Tika. The RTF-indexing works in general, but since I changed the Solr-schema, the indexer complains about missing mandatory fields, like "module-id". The rtf-file is generated by me and I added the metadata-fields to the RTF-document in the "userprops"-section of the RTF-file (see below) -- so Tika should be able to read it and to provide it. The problem is: I don't know HOW or WHERE Tika provides this metadata, so I don't know how to access it. As a result, I don't know how I can map it to the respective Solr-fields, like "module-id", that are mandatory in my Solr-schema. Can someone give me a hint, please? I am running out of ideas here ... :-/ {\rtf1\fbidis\ansi\ansicpg1252\deff0\deflang1031{\fonttbl{\f0\fnil\fcharset0 Arial;}} {\colortbl ;\red0\green0\blue0;} {\userprops {\propname module-id}\proptype30{\staticval 000ba8a6} } } Mit freundlichen Grüßen/ With kind regards Jan Schluchtmann Systems Engineering Cluster Instruments VW Group Continental Automotive GmbH Division Interior ID S3 RM VDO-Strasse 1, 64832 Babenhausen, Germany Telefon/Phone: +49 6073 12-4346 Telefax: +49 6073 12-79-4346
Re: Solr score use cases
Hi Faraz, Solr score which you could retrieved by adding in fl parameter could be helpful to understand the following: 1) search relevance ranking: how much score solr has given to the top & second top document, and with debug=true you could better understand what is causing that score. 2) You could use the function query to multiply score with some feature e.g. paid customers score, popularity score, etc to improve the relevance as per the business. I am able to think these few points only, someone can also put more light if I am missing anything. I hope this is what you want to know. 😊 Regards, Aman On Dec 1, 2017 13:38, "Faraz Fallahi" wrote: Hi A simple question: what are the most common use cases for the solr score of documents retrieved after firing queries? I dont have a real understanding of its purpose at the moment. Thx for helping
Re: Solr score use cases
Oki but If ID Just make an simple query with a "where Claude" and sort by a field i See no sense in calculating a score right? Am 01.12.2017 16:33 schrieb "Aman Tandon" : > Hi Faraz, > > Solr score which you could retrieved by adding in fl parameter could be > helpful to understand the following: > > 1) search relevance ranking: how much score solr has given to the top & > second top document, and with debug=true you could better understand what > is causing that score. > > 2) You could use the function query to multiply score with some feature > e.g. paid customers score, popularity score, etc to improve the relevance > as per the business. > > I am able to think these few points only, someone can also put more light > if I am missing anything. I hope this is what you want to know. 😊 > > Regards, > Aman > > On Dec 1, 2017 13:38, "Faraz Fallahi" > wrote: > > Hi > > A simple question: what are the most common use cases for the solr score of > documents retrieved after firing queries? > I dont have a real understanding of its purpose at the moment. > > Thx for helping >
Re: Solr score use cases
Or does the Score even get calculated when i sort or Not? Am 01.12.2017 4:38 nachm. schrieb "Faraz Fallahi" < faraz.fall...@googlemail.com>: > Oki but If ID Just make an simple query with a "where Claude" and sort by > a field i See no sense in calculating a score right? > > Am 01.12.2017 16:33 schrieb "Aman Tandon" : > >> Hi Faraz, >> >> Solr score which you could retrieved by adding in fl parameter could be >> helpful to understand the following: >> >> 1) search relevance ranking: how much score solr has given to the top & >> second top document, and with debug=true you could better understand what >> is causing that score. >> >> 2) You could use the function query to multiply score with some feature >> e.g. paid customers score, popularity score, etc to improve the relevance >> as per the business. >> >> I am able to think these few points only, someone can also put more light >> if I am missing anything. I hope this is what you want to know. 😊 >> >> Regards, >> Aman >> >> On Dec 1, 2017 13:38, "Faraz Fallahi" >> wrote: >> >> Hi >> >> A simple question: what are the most common use cases for the solr score >> of >> documents retrieved after firing queries? >> I dont have a real understanding of its purpose at the moment. >> >> Thx for helping >> >
JVM GC Issue
Hi, We are encountering issue with GC. Randomly nearly once a day there are consecutive full GC with no memory reclaimed. So the old generation heap usage grow up to the limit. Solr stop responding and we need to force restart. We are using Solr 6.6.1 with Oracle 1.8 JVM. The JVM settings are the default one with CMS GC. Solr is configured as standalone in master/slave replication. Replication are pooled each 10 minutes. Hardware has 4 vcpu and 16Gb RAM Both Xms and Xmx are set to 6Gb. The core contain 1.200.000 documents, with a lot of fields both stored and indexed without docValues enabled. The file system core size is 11Gb All Solr cache are sized to 1024 with autowarmcount to 256 Queries rate is low (10 per seconds). The only memory consuming feature are sort and facet. There are 10 facets from 10 to 2000 distinct values I don't see errors or warning in solr logs about opening or warming new searchers. here is link to gceasy report. http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTcvMTIvMS8tLXNvbHJfZ2MubG9nLjAuemlwLS0xMS0xNC0zMA== Gceasy suggest to increase heap size, but I do not agree Any suggestion about this issue ? Regards. Dominique -- Dominique Béjean 06 08 46 12 43
Re: Solr LTR plugin - Training
Also, when I want to add phrase match as a feature as below, it does not support it: { "store" : "tsf", "name" : "productTitleMatchGuestQuery", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{!dismax qf=p_title^1.0}${user_query}" } }, { "store" : "tsf", "name" : "productTitlePfMatchGuestQuery", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : *{ "q" : "{!dismax pf=p_title^2.0}${user_query}*" } }, Not sure, how to add phrase matching here. I also want to add pf2, pf3 as features if this works. Thanks - --Ilay -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re:LTR model upload
Hi Brian, Thank you for this question! It sounds like you may be experiencing the issue reported in https://issues.apache.org/jira/browse/SOLR-11049 and perhaps the same workaround could work for you? And in the upcoming Solr 7.2 release https://issues.apache.org/jira/browse/SOLR-11250 will provide an alternative way for handling large models. Hope that helps. Regards, Christine - Original Message - From: solr-user@lucene.apache.org To: solr-user@lucene.apache.org At: 11/28/17 23:00:11 When I upload, I can see my model when I hit "solr/collection_name/schema/model-store/" but when I reload the collection, it's gone. Is there a size limit for LTR models? I have a 1.6mb / 49,000 line long lambdamart model (that's what ranklib spit out for me) which I didn't think would be a huge problem. I decided to test by cutting down the model size by deleting 99% of it and it worked after reload. Does this mean my model is too big or do I possibly have a syntax bug somewhere in my model.json? * Brian
Re: JVM GC Issue
Dominique Bejean wrote: > We are encountering issue with GC. > Randomly nearly once a day there are consecutive full GC with no memory > reclaimed. [... 1.2M docs, Xmx 6GB ...] > Gceasy suggest to increase heap size, but I do not agree It does seem strange, with your apparently modest index & workload. Nothing you say sounds problematic to me and you have covered the usual culprits overlapping searchers, faceting and filterCache. Is it possible for you to share the solr.log around the two times that memory usage peaked? 2017-11-30 17:00-19:00 and 2017-12-01 08:00-12:00. If you cannot share, please check if you have excessive traffic around that time or if there is a lot of UnInverting going on (triggered by faceting on non.DocValues String fields). I know your post implies that you have already done so, so this is more of a sanity check. - Toke Eskildsen
Request return behavior when trigger DataImportHandler updates
Hello, When triggering the DataImportHandler to update via an HTTP request, I've noticed the handler behaves differently depending on how I specify the request parameters. If I use query parameters, e.g. http://localhost:8983/solr/mycore/dataimport?command=full-import, the HTTP request is not fulfilled until the import is completed. However, if I instead make a request to http://localhost:8983/solr/mycore/dataimport with a form-data key of "command"="full-import", the HTTP request returns immediately. In this case, the documentation seems to suggest the best way to detect when the import has completed is to poll Solr for status updates until the status switches from "busy" back to "idle". Is this difference in behavior intentional? I would like to be able to rely on the first behavior - to expect that the import will be finished when the HTTP call returns. However, I can't seem to find any documentation on this difference; I'd prefer not to have my component rely on what might be a convenient bug :). I'm using Solr 6.5.1. Thanks! Nathan Confidentiality Notice: This electronic transmission, and any documents attached to it, may contain confidential information belonging to the sender. This information is intended solely for the use of the individual or entity named above. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance upon the contents of this information is prohibited. If you have received this transmission in error, please notify the sender immediately and delete the message and all documents.
Re: Fwd: solr-security-proxy
The default blacklist is qt and stream, because there are examples of nasty things which can be done using those parms. But it seems much wiser to whitelist just the parms your web app needs to use. Am I missing something? Is there a simpler way to protect a Solr installation which just serves a few AJAX GETs? Cheers -- Rick On November 30, 2017 3:10:14 PM EST, Rick Leir wrote: >Hi all >I have just been looking at solr-security-proxy, which seems to be a >great little app to put in front of Solr (link below). But would it >make more sense to use a whitelist of Solr parameters instead of a >blacklist? >Thanks >Rick > >https://github.com/dergachev/solr-security-proxy > >solr-security-proxy >Node.js based reverse proxy to make a solr instance read-only, >rejecting requests that have the potential to modify the solr index. >--invalidParams Block these query params (comma separated) [default: >"qt,stream"] > > >-- >Sorry for being brief. Alternate email is rickleir at yahoo dot com -- Sorry for being brief. Alternate email is rickleir at yahoo dot com
starting SolrCloud nodes
Thanks to previous help. I have a ZK ensemble of three nodes running. I have uploaded the config for my collection and the solr.xml file. I have Solr installed on three machines. I think my next steps are: Start up each Solr instance: bin/solr start -c -z "zoo1:2181,zoo2:2181,zoo3:2181" // I have ZK_Hosts set to my ZK's, but the documentation seems to say I need to provide the list here to prevent embedded ZK from getting used. From one of the Solr instances create a collection: bin/solr create -c nameofuploadedconfig -s 3 -rf 2 //for example. The documentation I think implies that all of the Solr instances are automatically set with the collection. There is no further action at this point? Thanks. -S
Re: JVM GC Issue
Your autowarm counts are rather high, bit as Toke says this doesn't seem outrageous. I have seen situations where Solr is running close to the limits of its heap and GC only reclaims a tiny bit of memory each time, when you say "full GC with no memory reclaimed" is that really no memory _at all_? Or "almost no memory"? This situation can be alleviated by allocating more memory to the JVM . Your JVM pressure would certainly be reduced by enabling docValues on any field you sort,facet or group on. That would require a full reindex of course. Note that this makes your index on disk bigger, but reduces JVM pressure by roughly the same amount so it's a win in this situation. Have you attached a memory profiler to the running Solr instance? I'd be curious where the memory is being allocated. Best, Erick On Fri, Dec 1, 2017 at 8:31 AM, Toke Eskildsen wrote: > Dominique Bejean wrote: >> We are encountering issue with GC. > >> Randomly nearly once a day there are consecutive full GC with no memory >> reclaimed. > > [... 1.2M docs, Xmx 6GB ...] > >> Gceasy suggest to increase heap size, but I do not agree > > It does seem strange, with your apparently modest index & workload. Nothing > you say sounds problematic to me and you have covered the usual culprits > overlapping searchers, faceting and filterCache. > > Is it possible for you to share the solr.log around the two times that memory > usage peaked? 2017-11-30 17:00-19:00 and 2017-12-01 08:00-12:00. > > If you cannot share, please check if you have excessive traffic around that > time or if there is a lot of UnInverting going on (triggered by faceting on > non.DocValues String fields). I know your post implies that you have already > done so, so this is more of a sanity check. > > > - Toke Eskildsen
Re: starting SolrCloud nodes
Looks good. If you set ZK_HOST in your solr.in.sh you can forgo setting it when you start, but that's not necessary. Best, Erick On Fri, Dec 1, 2017 at 9:13 AM, Steve Pruitt wrote: > Thanks to previous help. I have a ZK ensemble of three nodes running. I > have uploaded the config for my collection and the solr.xml file. > I have Solr installed on three machines. > > I think my next steps are: > > Start up each Solr instance: bin/solr start -c -z > "zoo1:2181,zoo2:2181,zoo3:2181" // I have ZK_Hosts set to my ZK's, but the > documentation seems to say I need to provide the list here to prevent > embedded ZK from getting used. > > From one of the Solr instances create a collection: bin/solr create -c > nameofuploadedconfig -s 3 -rf 2 //for example. > > The documentation I think implies that all of the Solr instances are > automatically set with the collection. There is no further action at this > point? > > Thanks. > > -S
Re: [EXTERNAL] - Re: Basic SolrCloud help
On 11/30/2017 8:53 AM, Steve Pruitt wrote: > I took the hint and looked at both solr.in.cmd and solr.in.sh. Clearly > setting ZK_HOST is a first step. I am sure this is explained somewhere, but > I overlooked it. > From here, once I have Solr installed, I can run the Control Script to upload > a config set either when creating a collection or independent of creating the > collection. > > When I install Solr on the three nodes I have planned, I run the Control > Script and just point to somewhere on disk I have the config set stored. > > One question buried in my first missive was about mixing the Solr machines. > I was thinking of installing Solr on two VM's running Centos and then make my > third Solr node on my local machine, i.e. Windows. I can't think of why this > could be an issue, as long as everything is setup with the right ZK hosts, > etc. Does anyone know of any potential issues doing this? There shouldn't be any issue with different nodes running on different operating systems. There is no service install script for Windows, so you would have to do that yourself (usually people use NSSM or SRVANY) or you'd just have to start it manually, which I think means the include script (solr.in.cmd) will be in the bin directory. If the local machine you're describing is your personal machine, that is not something I would do. Personal machines typically need to be taken down and/or rebooted for various reasons on a frequent basis. Also, if the system resources on the local machine are significantly different than the VMs, SolrCloud's load balancing can't take that into account and send a different number of requests to each machine. Things work best when all the machines are substantially similar to each other. Also, since it's likely a personal Windows machine will be running a client OS like Windows 7 or Windows 10, I have heard from multiple sources that client operating systems from Microsoft are intentionally crippled in ways that the server operating systems (which cost a LOT more) aren't, so that they don't run server loads very well. I've never been able to find concrete information about exactly how Microsoft limits the functionality in the client OS. If anybody on this list has a source for concrete information, I'd love to see it. Last, I will mention my personal feelings about Windows compared to Linux. Performance seems to be generally better in Linux. NTFS is not as good as the native filesystems in Linux, although Lucene tends towards large files, which aren't as problematic as tons of little files. Memory management appears to me to be much better in Linux. The single biggest complaint I have about Windows is cost, especially for the server operating systems, which I think are the only good choice for a program like Solr. > I am not sure what "definition" and *config* are referencing. When I > initially install Solr it will not have a config set. I haven't created a > Collection yet. The running Solr instance upon initial install has no > config yet. But, I think I am not understanding what "definition" and > "*config*" mean. When I said "definition" I was talking about ZK_HOST or the argument to the -z option. The "config" would be the include script (usually solr.in.sh or solr.in.cmd) -- providing configuration options for Solr startup like heap size, zk info, gc tuning, etc. Thanks, Shawn
Re: Solr score use cases
Sorting certainly ignores scoring, I'm pretty sure it's just not calculated in that case. If your sorting results in multiple documents in the same bin, people will combine the primary sort with a secondary sort on score, so in that case the score is definitely calculated, ie "&sort=day asc, score desc" Returning the score with documents is usually for development purposes. Scores are _not_ comparable except within a single query, so IMO telling users that a doc from one search has a score of X and a doc from another search has a score of Y is useless-to-misleading information. A score of 2X is _not_ necessarily "twice as good" (or even as good) as a score of X in another search. FWIW, Erick On Fri, Dec 1, 2017 at 6:31 AM, Faraz Fallahi wrote: > Or does the Score even get calculated when i sort or Not? > > Am 01.12.2017 4:38 nachm. schrieb "Faraz Fallahi" < > faraz.fall...@googlemail.com>: > >> Oki but If ID Just make an simple query with a "where Claude" and sort by >> a field i See no sense in calculating a score right? >> >> Am 01.12.2017 16:33 schrieb "Aman Tandon" : >> >>> Hi Faraz, >>> >>> Solr score which you could retrieved by adding in fl parameter could be >>> helpful to understand the following: >>> >>> 1) search relevance ranking: how much score solr has given to the top & >>> second top document, and with debug=true you could better understand what >>> is causing that score. >>> >>> 2) You could use the function query to multiply score with some feature >>> e.g. paid customers score, popularity score, etc to improve the relevance >>> as per the business. >>> >>> I am able to think these few points only, someone can also put more light >>> if I am missing anything. I hope this is what you want to know. 😊 >>> >>> Regards, >>> Aman >>> >>> On Dec 1, 2017 13:38, "Faraz Fallahi" >>> wrote: >>> >>> Hi >>> >>> A simple question: what are the most common use cases for the solr score >>> of >>> documents retrieved after firing queries? >>> I dont have a real understanding of its purpose at the moment. >>> >>> Thx for helping >>> >>
Reg:- Indexing MySQL data with Solr
Hi , I am working on an Ecommerce database . I have more then 40 tables and around 20GB of data. I want to index data with solr for more effective search feature. Please tell me how to index MySQL data with Apache solr. Thanks in advance Nandan Priyadarshi
Re: Request return behavior when trigger DataImportHandler updates
On 12/1/2017 9:37 AM, Nathan Friend wrote: > When triggering the DataImportHandler to update via an HTTP request, I've > noticed the handler behaves differently depending on how I specify the > request parameters. > > If I use query parameters, e.g. > http://localhost:8983/solr/mycore/dataimport?command=full-import, the HTTP > request is not fulfilled until the import is completed. > > However, if I instead make a request to > http://localhost:8983/solr/mycore/dataimport with a form-data key of > "command"="full-import", the HTTP request returns immediately. In this case, > the documentation seems to suggest the best way to detect when the import has > completed is to poll Solr for status updates until the status switches from > "busy" back to "idle". At one time, way back in version 1.4 and 3.2, I was making such requests from a Perl program, and they were definitely GET requests. These requests did NOT wait until the import completed. I just checked my most recent program, which uses SolrJ, and I discovered that I am using Method.POST there, so I can't confirm or deny what you're seeing yet. At one point, I went through all my SolrJ code and made sure all requests were using POST, so that I would never run into the limitations on the URL size. It wasn't really necessary for my code that starts DIH, but I was making the change everywhere. > Is this difference in behavior intentional? I would like to be able to rely > on the first behavior - to expect that the import will be finished when the > HTTP call returns. However, I can't seem to find any documentation on this > difference; I'd prefer not to have my component rely on what might be a > convenient bug :). As far as I know, Solr should be unaware of any difference between these ways of supplying the parameters, because the servlet container (usually Jetty) should just make parameters available to Solr, no matter how it received them. I think that the dataimport handler has always been async in nature. I've NEVER seen an import command wait to respond until the import is complete. My imports take many hours to run. If that's what you're seeing happen with a GET request, I'd call that a bug. We need to confirm the problem in the most recent versions before filing an issue. Thanks, Shawn
Re: Reg:- Indexing MySQL data with Solr
You can use data import handler with cache, its faster. Check document : https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html On Sat, Dec 2, 2017 at 12:21 AM, @Nandan@ wrote: > Hi , > I am working on an Ecommerce database . I have more then 40 tables and > around 20GB of data. > I want to index data with solr for more effective search feature. > Please tell me how to index MySQL data with Apache solr. > Thanks in advance > Nandan Priyadarshi > -- Thanks, Sujay P Bawaskar M:+91-77091 53669
RE: [Marketing Mail] Re: Request return behavior when trigger DataImportHandler updates
Thanks for the reply, Shawn. One additional detail I should have mentioned - in both scenarios, I'm making a POST request, not a GET. Even when using a POST, Solr seems to be able to accept URL query parameters. I'm making test requests using Postman (https://www.getpostman.com/), so it's also possible this issue is related to how Postman is sending the request (although I doubt it). Either way, it sounds like relying on the synchronous behavior of the HTTP request is not a good option. I've been looking into the "onImportEnd" event listener functionality and think this feature will provide a more robust way of generating notifications when imports have finished. Thanks! Nathan -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: December 1, 2017 3:04 PM To: solr-user@lucene.apache.org Subject: [Marketing Mail] Re: Request return behavior when trigger DataImportHandler updates On 12/1/2017 9:37 AM, Nathan Friend wrote: > When triggering the DataImportHandler to update via an HTTP request, I've > noticed the handler behaves differently depending on how I specify the > request parameters. > > If I use query parameters, e.g. > http://localhost:8983/solr/mycore/dataimport?command=full-import, the HTTP > request is not fulfilled until the import is completed. > > However, if I instead make a request to > http://localhost:8983/solr/mycore/dataimport with a form-data key of > "command"="full-import", the HTTP request returns immediately. In this case, > the documentation seems to suggest the best way to detect when the import has > completed is to poll Solr for status updates until the status switches from > "busy" back to "idle". At one time, way back in version 1.4 and 3.2, I was making such requests from a Perl program, and they were definitely GET requests. These requests did NOT wait until the import completed. I just checked my most recent program, which uses SolrJ, and I discovered that I am using Method.POST there, so I can't confirm or deny what you're seeing yet. At one point, I went through all my SolrJ code and made sure all requests were using POST, so that I would never run into the limitations on the URL size. It wasn't really necessary for my code that starts DIH, but I was making the change everywhere. > Is this difference in behavior intentional? I would like to be able to rely > on the first behavior - to expect that the import will be finished when the > HTTP call returns. However, I can't seem to find any documentation on this > difference; I'd prefer not to have my component rely on what might be a > convenient bug :). As far as I know, Solr should be unaware of any difference between these ways of supplying the parameters, because the servlet container (usually Jetty) should just make parameters available to Solr, no matter how it received them. I think that the dataimport handler has always been async in nature. I've NEVER seen an import command wait to respond until the import is complete. My imports take many hours to run. If that's what you're seeing happen with a GET request, I'd call that a bug. We need to confirm the problem in the most recent versions before filing an issue. Thanks, Shawn Confidentiality Notice: This electronic transmission, and any documents attached to it, may contain confidential information belonging to the sender. This information is intended solely for the use of the individual or entity named above. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance upon the contents of this information is prohibited. If you have received this transmission in error, please notify the sender immediately and delete the message and all documents.
Java 9 and Solr 6.6
HI all. Would you recommend installing Solr 6.6.1 with Java 9 for a production environement? Thanks, Sergio -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Reg:- Indexing MySQL data with Solr
Or, depending on your environment and comfort with Java, use SolrJ. Here's an example: https://lucidworks.com/2012/02/14/indexing-with-solrj/ Best, Erick On Fri, Dec 1, 2017 at 11:22 AM, Sujay Bawaskar wrote: > You can use data import handler with cache, its faster. > > Check document : > https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html > > On Sat, Dec 2, 2017 at 12:21 AM, @Nandan@ > wrote: > >> Hi , >> I am working on an Ecommerce database . I have more then 40 tables and >> around 20GB of data. >> I want to index data with solr for more effective search feature. >> Please tell me how to index MySQL data with Apache solr. >> Thanks in advance >> Nandan Priyadarshi >> > > > > -- > Thanks, > Sujay P Bawaskar > M:+91-77091 53669
RE: Re:LTR model upload
Thanks Christine, that makes sense. I am currently working around the issue by limiting how many trees Ranklib produces which reduces the file size. --Brian -Original Message- From: Christine Poerschke (BLOOMBERG/ QUEEN VIC) [mailto:cpoersc...@bloomberg.net] Sent: Friday, December 01, 2017 11:03 AM To: solr-user@lucene.apache.org Subject: Re:LTR model upload Hi Brian, Thank you for this question! It sounds like you may be experiencing the issue reported in https://issues.apache.org/jira/browse/SOLR-11049 and perhaps the same workaround could work for you? And in the upcoming Solr 7.2 release https://issues.apache.org/jira/browse/SOLR-11250 will provide an alternative way for handling large models. Hope that helps. Regards, Christine - Original Message - From: solr-user@lucene.apache.org To: solr-user@lucene.apache.org At: 11/28/17 23:00:11 When I upload, I can see my model when I hit "solr/collection_name/schema/model-store/" but when I reload the collection, it's gone. Is there a size limit for LTR models? I have a 1.6mb / 49,000 line long lambdamart model (that's what ranklib spit out for me) which I didn't think would be a huge problem. I decided to test by cutting down the model size by deleting 99% of it and it worked after reload. Does this mean my model is too big or do I possibly have a syntax bug somewhere in my model.json? * Brian
Re: JVM GC Issue
Hi, Thank you both for your responses. I just have solr log for the very last period of the CG log. Grep command allows me to count queries per minute with hits > 1000 or > 1 and so with the biggest impact on memory and cpu during faceting > 1000 59 11:13 45 11:14 36 11:15 45 11:16 59 11:17 40 11:18 95 11:19 123 11:20 137 11:21 123 11:22 86 11:23 26 11:24 19 11:25 17 11:26 > 1 55 11:19 78 11:20 48 11:21 134 11:22 93 11:23 10 11:24 So we see that at the time GC start become nuts, large result set count increase. The query field include phonetic filter and results are really not relevant due to this. I will suggest to : 1/ remove the phonetic filter in order to have less irrelevant results and so get smaller result set 2/ enable docValues on field use for faceting I expect decrease GC requirements and stabilize GC. Regards Dominique Le ven. 1 déc. 2017 à 18:17, Erick Erickson a écrit : > Your autowarm counts are rather high, bit as Toke says this doesn't > seem outrageous. > > I have seen situations where Solr is running close to the limits of > its heap and GC only reclaims a tiny bit of memory each time, when you > say "full GC with no memory > reclaimed" is that really no memory _at all_? Or "almost no memory"? > This situation can be alleviated by allocating more memory to the JVM > . > > Your JVM pressure would certainly be reduced by enabling docValues on > any field you sort,facet or group on. That would require a full > reindex of course. Note that this makes your index on disk bigger, but > reduces JVM pressure by roughly the same amount so it's a win in this > situation. > > Have you attached a memory profiler to the running Solr instance? I'd > be curious where the memory is being allocated. > > Best, > Erick > > On Fri, Dec 1, 2017 at 8:31 AM, Toke Eskildsen wrote: > > Dominique Bejean wrote: > >> We are encountering issue with GC. > > > >> Randomly nearly once a day there are consecutive full GC with no memory > >> reclaimed. > > > > [... 1.2M docs, Xmx 6GB ...] > > > >> Gceasy suggest to increase heap size, but I do not agree > > > > It does seem strange, with your apparently modest index & workload. > Nothing you say sounds problematic to me and you have covered the usual > culprits overlapping searchers, faceting and filterCache. > > > > Is it possible for you to share the solr.log around the two times that > memory usage peaked? 2017-11-30 17:00-19:00 and 2017-12-01 08:00-12:00. > > > > If you cannot share, please check if you have excessive traffic around > that time or if there is a lot of UnInverting going on (triggered by > faceting on non.DocValues String fields). I know your post implies that you > have already done so, so this is more of a sanity check. > > > > > > - Toke Eskildsen > -- Dominique Béjean 06 08 46 12 43
Re: starting SolrCloud nodes
On 12/1/2017 10:13 AM, Steve Pruitt wrote: > Thanks to previous help. I have a ZK ensemble of three nodes running. I > have uploaded the config for my collection and the solr.xml file. > I have Solr installed on three machines. > > I think my next steps are: > > Start up each Solr instance: bin/solr start -c -z > "zoo1:2181,zoo2:2181,zoo3:2181" // I have ZK_Hosts set to my ZK's, but the > documentation seems to say I need to provide the list here to prevent > embedded ZK from getting used. The embedded ZK will only get started if you use the -c option and there is no ZK_HOST variable and no -z option on the commandline. If you use both -z and ZK_HOST, then the info I've seen says the -z option will take priority. I haven't looked at the script closely enough to confirm, but that would be the most logical way to operate, so it's probably correct. If ZK_HOST is defined or you use the -z option, you do not need to include the -c option when starting Solr. SolrCloud mode is assumed when ZK info is available. The only time the -c option is *required* is when you want to start the embedded zookeeper. Having the option won't hurt anything, though. To start a solr *service* in cloud mode, all you need is to add ZK_HOST to /etc/default/.in.sh, where is the service name, which defaults to solr. > From one of the Solr instances create a collection: > bin/solr create -c nameofuploadedconfig -s 3 -rf 2 //for example. Nitpick: The -c option on the create command is the name of the collection. To specify the name of the uploaded config, if it happens to be different from the collection name, use the -n option. You can use the -d option to point at a config on disk, and it will be uploaded to a config in zookeeper named after the -n option or the collection. The collection name and the config name are not required to match. You can use the same config for multiple collections. > The documentation I think implies that all of the Solr instances are > automatically set with the collection. There is no further action at this > point? Solr will make an automatic decision as to which nodes will be used to hold the collection. If you use the Collections API directly rather than the commandline, you can give Solr an explicit list of nodes to use. Without the explicit list, Solr will spread the collection across the cluster as widely as it can. https://lucene.apache.org/solr/guide/6_6/collections-api.html#CollectionsAPI-create The "bin/solr create" command, when used on SolrCloud, just makes an HTTP request to the Collections API, unless you use the -d option, in which case it will upload a config to zookeeper before calling the Collections API. Thanks, Shawn
Re: Solr score use cases
Thx for the clarification Best regards Am 01.12.2017 18:25 schrieb "Erick Erickson" : > Sorting certainly ignores scoring, I'm pretty sure it's just not > calculated in that case. > > If your sorting results in multiple documents in the same bin, people > will combine the primary sort with a secondary sort on score, so in > that case the score is definitely calculated, ie "&sort=day asc, score > desc" > > Returning the score with documents is usually for development > purposes. Scores are _not_ comparable except within a single query, so > IMO telling users that a doc from one search has a score of X and a > doc from another search has a score of Y is useless-to-misleading > information. A score of 2X is _not_ necessarily "twice as good" (or > even as good) as a score of X in another search. > > FWIW, > Erick > > On Fri, Dec 1, 2017 at 6:31 AM, Faraz Fallahi > wrote: > > Or does the Score even get calculated when i sort or Not? > > > > Am 01.12.2017 4:38 nachm. schrieb "Faraz Fallahi" < > > faraz.fall...@googlemail.com>: > > > >> Oki but If ID Just make an simple query with a "where Claude" and sort > by > >> a field i See no sense in calculating a score right? > >> > >> Am 01.12.2017 16:33 schrieb "Aman Tandon" : > >> > >>> Hi Faraz, > >>> > >>> Solr score which you could retrieved by adding in fl parameter could be > >>> helpful to understand the following: > >>> > >>> 1) search relevance ranking: how much score solr has given to the top & > >>> second top document, and with debug=true you could better understand > what > >>> is causing that score. > >>> > >>> 2) You could use the function query to multiply score with some feature > >>> e.g. paid customers score, popularity score, etc to improve the > relevance > >>> as per the business. > >>> > >>> I am able to think these few points only, someone can also put more > light > >>> if I am missing anything. I hope this is what you want to know. 😊 > >>> > >>> Regards, > >>> Aman > >>> > >>> On Dec 1, 2017 13:38, "Faraz Fallahi" > >>> wrote: > >>> > >>> Hi > >>> > >>> A simple question: what are the most common use cases for the solr > score > >>> of > >>> documents retrieved after firing queries? > >>> I dont have a real understanding of its purpose at the moment. > >>> > >>> Thx for helping > >>> > >> >
Re: Reg:- Indexing MySQL data with Solr
On 12/1/2017 11:51 AM, @Nandan@ wrote: > I am working on an Ecommerce database . I have more then 40 tables and > around 20GB of data. > I want to index data with solr for more effective search feature. > Please tell me how to index MySQL data with Apache solr. There are multiple possible ways to do this. One of the most straightforward is to use the dataimport handler. DIH is reasonably efficient, although the one major limitation it has is that it's single threaded. Depending on how your multiple tables relate to each other, it may be a little bit challenging to configure properly. The best results with DIH are obtained when you can write a single SQL query (possibly using the view feature in MySQL) that can retrieve every bit of data you want to index. For best performance, you would want to write a custom program that can retrieve the data from your database and then start multiple threads or multiple processes to do indexing in parallel. You may even want to use multiple threads for the retrieval process, depending on how fast a single request against the database can return data. Thanks, Shawn
Re: Java 9 and Solr 6.6
On 12/1/2017 12:32 PM, marotosg wrote: > Would you recommend installing Solr 6.6.1 with Java 9 for a production > environement? Solr 7.x has been tested with Java 9 and should work with no problems. I believe that code changes were required to achieve this compatibility, so 6.6 might have issues with Java 9. The release information for 6.6 only mentions Java 8, while the release information for 7.0 explicitly says that it works with Java 9. I would not try running 6.6 with Java 9 in production without first testing every part of the implementation on a dev server ... and based on the limited information I know about, I'm not confident that those tests would pass. Thanks, Shawn
Re: JVM GC Issue
Doninique: Actually, the memory requirements shouldn't really go up as the number of hits increases. The general algorithm is (say rows=10) Calcluate the score of each doc If the score is zero, ignore If the score is > the minimum in my current top 10, replace the lowest scoring doc in my current top 10 with the new doc (a PriorityQueue last I knew). else discard the doc. When all docs have been scored, assemble the return from the top 10 (or whatever rows is set to). The key here is that most of the Solr index is kept in MMapDirecotry/OS space, see Uwe's excellent blog here: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html. In terms of _searching_, very little of the Lucene index structures are kept in memory. That said, faceting plays a bit loose with the rules. If you have docValues set to true, most of the memory structures are in the OS memory space, not the JVM. If you have docValues set to false, then the "uninverted" structure is built in the JVM heap space. Additionally, the JVM requirements are sensitive to the number of unique values in field being faceted on. For instance, let's say you faceted by a date field with just facet.field=some_date_field. A bucket would have to be allocated to hold the counts for each and every unique date field, i.e. one for each millisecond in your search, which might be something you're seeing. Conceptually this is just an array[uniqueValues] of ints (longs? I'm not sure). This should be relatively easily testable by omitting the facets while measuring. Where the number of rows _does_ make a difference is in the return packet. Say I have rows=10. In that case I create a single return packet with all 10 docs "fl" field. If rows = 10,000 then that return packet is obviously 1,000 times as large and must be assembled in memory. I rather doubt the phonetic filter is to blame. But you can test this by just omitting the field containing the phonetic filter in the search query. I've certainly been wrong before. Best, Erick On Fri, Dec 1, 2017 at 2:31 PM, Dominique Bejean wrote: > Hi, > > > Thank you both for your responses. > > > I just have solr log for the very last period of the CG log. > > > Grep command allows me to count queries per minute with hits > 1000 or > > 1 and so with the biggest impact on memory and cpu during faceting > > >> 1000 > > 59 11:13 > > 45 11:14 > > 36 11:15 > > 45 11:16 > > 59 11:17 > > 40 11:18 > > 95 11:19 > > 123 11:20 > > 137 11:21 > > 123 11:22 > > 86 11:23 > > 26 11:24 > > 19 11:25 > > 17 11:26 > > >> 1 > > 55 11:19 > > 78 11:20 > > 48 11:21 > > 134 11:22 > > 93 11:23 > > 10 11:24 > > > So we see that at the time GC start become nuts, large result set count > increase. > > > The query field include phonetic filter and results are really not relevant > due to this. I will suggest to : > > > 1/ remove the phonetic filter in order to have less irrelevant results and > so get smaller result set > > 2/ enable docValues on field use for faceting > > > I expect decrease GC requirements and stabilize GC. > > > Regards > > > Dominique > > > > > > Le ven. 1 déc. 2017 à 18:17, Erick Erickson a > écrit : > >> Your autowarm counts are rather high, bit as Toke says this doesn't >> seem outrageous. >> >> I have seen situations where Solr is running close to the limits of >> its heap and GC only reclaims a tiny bit of memory each time, when you >> say "full GC with no memory >> reclaimed" is that really no memory _at all_? Or "almost no memory"? >> This situation can be alleviated by allocating more memory to the JVM >> . >> >> Your JVM pressure would certainly be reduced by enabling docValues on >> any field you sort,facet or group on. That would require a full >> reindex of course. Note that this makes your index on disk bigger, but >> reduces JVM pressure by roughly the same amount so it's a win in this >> situation. >> >> Have you attached a memory profiler to the running Solr instance? I'd >> be curious where the memory is being allocated. >> >> Best, >> Erick >> >> On Fri, Dec 1, 2017 at 8:31 AM, Toke Eskildsen wrote: >> > Dominique Bejean wrote: >> >> We are encountering issue with GC. >> > >> >> Randomly nearly once a day there are consecutive full GC with no memory >> >> reclaimed. >> > >> > [... 1.2M docs, Xmx 6GB ...] >> > >> >> Gceasy suggest to increase heap size, but I do not agree >> > >> > It does seem strange, with your apparently modest index & workload. >> Nothing you say sounds problematic to me and you have covered the usual >> culprits overlapping searchers, faceting and filterCache. >> > >> > Is it possible for you to share the solr.log around the two times that >> memory usage peaked? 2017-11-30 17:00-19:00 and 2017-12-01 08:00-12:00. >> > >> > If you cannot share, please check if you have excessive traffic around >> that time or if there is a lot of UnInverting going on (triggered by >> f