Re: Extracting excerpt from solr
Nevermind, I found a solution. I created an excerpt field in the schema.xml, then I used the copyField method with the maxChars parameter declared to copy the content into it with a limitation of the amount of characters that I wanted. Thanks anyways. -- View this message in context: http://lucene.472066.n3.nabble.com/Extracting-excerpt-from-solr-tp4049067p4049358.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr indexing binary files
Hi, I am new with Solr and I am extracting metadata from binary files through URLs stored in my database. I would like to know what fields are available for indexing from PDFs (the ones that would be initiated as in column=””). For example how would I extract something like file size, format or file type. I would also like to know how to create customized fields in Solr. How those metadata and text content are mapped into Solr schema? Would I have to declare that in the solrconfig.xml or do some more tweaking somewhere else? If someone has a code snippet that could show me it would be greatly appreciated. Thank you in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-binary-files-tp4047470.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr indexing binary files
Hi Jack, thanks a lot for your reply. I did that . However, when I run Solr it gives me a bunch of errors. It actually displays the content of my files on my command line and shows some logs like this: org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id at org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:468) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:350) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:70) at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:234) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:500) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468) 15-Mar-2013 9:56:29 AM org.apache.solr.handler.dataimport.DocBuilder execute I do have an uniqueKey though. Any ideas what the problem might be? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-binary-files-tp4047470p4047690.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr indexing binary files
Hi Gora, thank you for your reply. I am not using any commands, I just go on the Solr dashboard, db > Dataimport and execute a full-import. *My schema.xml looks like this:* * * *My db-data-config.xml looks like this:* *In my solrconfig.xml I have this:* db-data-config.xml true metadata_ last_modified text size initials name subject company title comments words last_modified_by true Thank you for your help! -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-binary-files-tp4047470p4047702.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr indexing binary files
Sorry, Gora. It is ${fileSourcePaths.urlpath} actually. *My complete schema.xml is this:* id text -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-binary-files-tp4047470p4047778.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr indexing binary files
Hi Gora, Yes, my urlpath points to an url like that. I do not get why uncommenting the catch all dynamic field ("*") does not work for me. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-binary-files-tp4047470p4048542.html Sent from the Solr - User mailing list archive at Nabble.com.
Extracting excerpt from solr
Hi, I am using solr to index data from binary files using BinURLDataSource. I was wondering if anyone knows how to extract an excerpt of the indexed data during search. For example if someone made a search it would return 200 characters as a preview of the whole text content. I read online that hl would do the trick. I tried it even though I am not as interested in highlighting as I am in pulling the excerpt. However, so far I have not been able to make it work. I added a /browse requestHandler to my solrconfig.xml like this: explicit true [HIGHLIGHT] [/HIGHLIGHT] text title true colored 3 70 true 0.5 [-\w ,/\n\"']{20,200} I tried it in other requestHandlers as well without any success. Does anyone have some hints? Thanks In advance. -- View this message in context: http://lucene.472066.n3.nabble.com/Extracting-excerpt-from-solr-tp4049067.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr incorrectly fetching elements that do not match conditions in where as all-null rows
Hello, Solr is trying to process non-existing child/nested entities. By non-existing I mean that they exist in DB but should not be at Solr side because they don't match the conditions in the query I use to fetch them. I have the below solr data configuration. The relationship between tables is complicated, but the point is that I need to fetch child/nested entities and perform some calculations at query time. My problem is that some products have onSite services that are not enabled. I would expect Solr from ignoring those elements because of the conditions in the query. If I turn debug on when importing, I can see that all fields are null. However, Solr still tries to process them, which results in invalid SQL queries because it replaces null fields with nothing. The problem seems to be related to the condition s.enabled=true in the query, because are rows with enabled=false that are causing problems (Solr interprets them as rows with all fields null). I get an invalid SQL query SELECT CONCAT( * (1 - percentage), ',', 'USD') AS fullReducedOnSitePrice FROM discounts WHERE companyId=65. How can I force Solr to ignore, as it should, those elements?
Re: Solr incorrectly fetching elements that do not match conditions in where as all-null rows
Thanks for the promp reply. h.enabled=true is a typo. It should be c.enabled=true, because the table companies also has a column called enabled. That part is working fine (it doesn't fetch companies with enabled=false). About the DB queries, I've taken, by turning Debug and Verbose on in the Dataimport tab, the queries that Solr is sending to DB, executed the same queries in my MySQL client. It clearly says '0 row(s) returned'. 2016-08-15 15:37 GMT+02:00 Alexandre Rafalovitch : > Solr (well DIH) just passes that query to the DB, so if you are > getting extra rows (not extra fields), than I would focus on the > database side of the situation. > > Specifically, I would confirm from the database logs what the sent > query actually looks like. > > Very specifically, in your very first entity, I see the condition > "h.enabled=true" where "h" does not match the table names in the FROM > statement. Perhaps, that's the problem? > > Regards, >Alex. > > Newsletter and resources for Solr beginners and intermediates: > http://www.solr-start.com/ > > > On 15 August 2016 at 23:27, Luis Sepúlveda wrote: > > Hello, > > > > Solr is trying to process non-existing child/nested entities. By > > non-existing I mean that they exist in DB but should not be at Solr side > > because they don't match the conditions in the query I use to fetch them. > > > > I have the below solr data configuration. The relationship between tables > > is complicated, but the point is that I need to fetch child/nested > entities > > and perform some calculations at query time. My problem is that some > > products have onSite services that are not enabled. I would expect Solr > > from ignoring those elements because of the conditions in the query. If I > > turn debug on when importing, I can see that all fields are null. > However, > > Solr still tries to process them, which results in invalid SQL queries > > because it replaces null fields with nothing. > > > > > > > > > query="SELECT s.serviceType, sl.language FROM services s > > LEFT JOIN serviceLanguages sl ON s.id=sl.serviceId WHERE > > companyId=${product.companyId} AND s.enabled=true"> > > > > > > > > > > > query="SELECT s.id, s.enabled, ${product.unitPrice} + > > (hourlyPrice * MIN(hours)) AS onSitePriceRaw, > CONCAT(${product.unitPrice} + > > (hourlyPrice * MIN(hours)), ',', '${product.currency}') AS onSitePrice > FROM > > services s LEFT JOIN serviceHourlyPrices shp ON s.id=shp.serviceId WHERE > > companyId=${product.companyId} AND s.enabled=true AND > s.serviceType='OS'"> > > > > > query="SELECT CONCAT(${onSite.onSitePriceRaw} * (1 - > > percentage), ',', '${product.currency}') AS fullReducedOnSitePrice FROM > > discounts WHERE companyId=${product.companyId} AND category='FULL'"> > > > column="fullReducedOnSitePrice"/> > > > > > query="SELECT CONCAT(${onSite.onSitePriceRaw} * (1 - > > percentage), ',', '${product.currency}') AS partialReducedOnSitePrice > FROM > > discounts WHERE companyId=${product.companyId} AND category='PARTIAL'"> > > > column="partialReducedOnSitePrice"/> > > > > > > > > > > The problem seems to be related to the condition s.enabled=true in the > > query, because are rows with enabled=false that are causing problems > (Solr > > interprets them as rows with all fields null). I get an invalid SQL query > > SELECT CONCAT( * (1 - percentage), ',', 'USD') AS fullReducedOnSitePrice > > FROM discounts WHERE companyId=65. > > > > How can I force Solr to ignore, as it should, those elements? >
Re: Solr incorrectly fetching elements that do not match conditions in where as all-null rows
I'm very sorry, but you're right. Using one of the queries from the query log, I get a 1 row(s) returned. So it itsn't a Solr issue. Thanks a lot Alexandre. 2016-08-15 16:17 GMT+02:00 Alexandre Rafalovitch : > Hmm. I would still take as truth the database logs as opposed to Solr > logs. Or at least network traces using something like Wireshark. > > Otherwise, you need some way to reduce your DIH query to the minimum > reproducible example. I am used to reading tech support emails and > even then I am not sure I can parse the significant configuration > aspects from the multiple parallel and nested entities. Can you reduce > this to the simplest (two level?) entity definition with a single > field and explain what you expected and what you are seeing. > > Regards, >Alex. > P.s. Solr DIH does have a gotcha with SQL import that it automagically > tries to match table column names to fields defined in schema and > populate them even if not explicitly declared. This does not match to > the way you describe the problem (your select statement still needs to > return those fields), but perhaps it interacts with something else to > trigger it. > > Newsletter and resources for Solr beginners and intermediates: > http://www.solr-start.com/ > > > On 15 August 2016 at 23:54, Luis Sepúlveda wrote: > > Thanks for the promp reply. > > > > h.enabled=true is a typo. It should be c.enabled=true, because the table > > companies also has a column called enabled. That part is working fine (it > > doesn't fetch companies with enabled=false). > > > > About the DB queries, I've taken, by turning Debug and Verbose on in the > > Dataimport tab, the queries that Solr is sending to DB, executed the same > > queries in my MySQL client. It clearly says '0 row(s) returned'. > > > > 2016-08-15 15:37 GMT+02:00 Alexandre Rafalovitch : > > > >> Solr (well DIH) just passes that query to the DB, so if you are > >> getting extra rows (not extra fields), than I would focus on the > >> database side of the situation. > >> > >> Specifically, I would confirm from the database logs what the sent > >> query actually looks like. > >> > >> Very specifically, in your very first entity, I see the condition > >> "h.enabled=true" where "h" does not match the table names in the FROM > >> statement. Perhaps, that's the problem? > >> > >> Regards, > >>Alex. > >> > >> Newsletter and resources for Solr beginners and intermediates: > >> http://www.solr-start.com/ > >> > >> > >> On 15 August 2016 at 23:27, Luis Sepúlveda wrote: > >> > Hello, > >> > > >> > Solr is trying to process non-existing child/nested entities. By > >> > non-existing I mean that they exist in DB but should not be at Solr > side > >> > because they don't match the conditions in the query I use to fetch > them. > >> > > >> > I have the below solr data configuration. The relationship between > tables > >> > is complicated, but the point is that I need to fetch child/nested > >> entities > >> > and perform some calculations at query time. My problem is that some > >> > products have onSite services that are not enabled. I would expect > Solr > >> > from ignoring those elements because of the conditions in the query. > If I > >> > turn debug on when importing, I can see that all fields are null. > >> However, > >> > Solr still tries to process them, which results in invalid SQL queries > >> > because it replaces null fields with nothing. > >> > > >> > > >> > > >> > >> > query="SELECT s.serviceType, sl.language FROM > services s > >> > LEFT JOIN serviceLanguages sl ON s.id=sl.serviceId WHERE > >> > companyId=${product.companyId} AND s.enabled=true"> > >> > > >> > > >> > > >> > > >> > >> > query="SELECT s.id, s.enabled, ${product.unitPrice} + > >> > (hourlyPrice * MIN(hours)) AS onSitePriceRaw, > >> CONCAT(${product.unitPrice} + > >> > (hourlyPrice * MIN(hours)), ',', '${product.currency}') AS onSitePrice > >> FROM > >> > services s LEFT JOIN serviceHourlyPrices shp ON s.id=shp.serviceId > WHERE > >> > companyId=${product.companyId} AND s.enabled=true AND >
Re: More Robust Search Timeouts (to Kill Zombie Queries)?
Hi Salman, I was interested in something similar, take a look at the following thread: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201401.mbox/%3CCADSoL-i04aYrsOo2%3DGcaFqsQ3mViF%2Bhn24ArDtT%3D7kpALtVHzA%40mail.gmail.com%3E#archives I never followed through, however. -Luis On Mon, Mar 31, 2014 at 6:24 AM, Salman Akram < salman.ak...@northbaysolutions.net> wrote: > Anyone? > > > On Wed, Mar 26, 2014 at 7:55 PM, Salman Akram < > salman.ak...@northbaysolutions.net> wrote: > > > With reference to this thread< > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E>I > wanted to know if there was any response to that or if Chris Harris > > himself can comment on what he ended up doing, that would be great! > > > > > > -- > > Regards, > > > > Salman Akram > > > > > > > -- > Regards, > > Salman Akram >
Re: More Robust Search Timeouts (to Kill Zombie Queries)?
I got responses, but no easy solution to allow me to directly cancel a request. The responses did point to: - timeAllowed query parameter that returns partial results - https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-ThetimeAllowedParameter - A possible hack that I never followed through - http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201401.mbox/%3CCANGii8eaSouePGxa7JfvOBhrnJUL++Ct4rQha2pxMefvaWhH=g...@mail.gmail.com%3E Maybe one of those will help you? If they do, make sure to report back! -Luis On Tue, Apr 1, 2014 at 3:13 AM, Salman Akram < salman.ak...@northbaysolutions.net> wrote: > So you too never got any response... > > > On Mon, Mar 31, 2014 at 6:57 PM, Luis Lebolo > wrote: > > > Hi Salman, > > > > I was interested in something similar, take a look at the following > thread: > > > > > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201401.mbox/%3CCADSoL-i04aYrsOo2%3DGcaFqsQ3mViF%2Bhn24ArDtT%3D7kpALtVHzA%40mail.gmail.com%3E#archives > > > > I never followed through, however. > > > > -Luis > > > > > > On Mon, Mar 31, 2014 at 6:24 AM, Salman Akram < > > salman.ak...@northbaysolutions.net> wrote: > > > > > Anyone? > > > > > > > > > On Wed, Mar 26, 2014 at 7:55 PM, Salman Akram < > > > salman.ak...@northbaysolutions.net> wrote: > > > > > > > With reference to this thread< > > > > > > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E > > >I > > > wanted to know if there was any response to that or if Chris Harris > > > > himself can comment on what he ended up doing, that would be great! > > > > > > > > > > > > -- > > > > Regards, > > > > > > > > Salman Akram > > > > > > > > > > > > > > > > > -- > > > Regards, > > > > > > Salman Akram > > > > > > > > > -- > Regards, > > Salman Akram >
Re: Replication: slow first query after replication.
Hello, Shawn! I have seen that when disabling replication and executing queries the time responses are good. Interesting... I can't ser the solution, then, because slow replication tomes are needed to almost always get 'fresh' documents in slaves to search by, but this appareantly slows down first queries launched because of caches warm up. There must be a solution for this scenario - I think that it should be very common. Do you think that disabling caches will improve this? Thanks a lot! - Luis Cappa > El 05/11/2013, a las 23:29, Shawn Heisey escribió: > >> On 11/5/2013 10:16 AM, Luis Cappa Banda wrote: >> I have a master-slave replication (Solr 4.1 version) with a 30 seconds >> polling interval and continuously new documents are indexed, so after 30 >> seconds always new data must be replicated. My test index is not huge: just >> 5M documents. >> >> I have experimented that a simple "q=*:*" query appears to be very slow (up >> to 10 secs of QTime). After that first slow query the following "q=*:*" >> queries are much quicker. I feel that warming up caches after replication >> has something to say about this weird behavior, but maybe an index re-built >> is also involved. >> >> Question time: >> >> *1.* How can I warm up caches against? There exists any solrconfig.xml >> searcher to configure to be executed after replication events? >> >> *2. *My system needs to execute queries to the slaves continuously. If >> there exists any warm up way to reload caches, some queries will experience >> slow response times until reload has finished, isn't it? >> >> *3. *After a replication has done, does Solr execute any index rebuild >> operation that slow down query responses, or this poor performance is just >> due to caches? >> >> *4. *My system is always querying by the latest documents indexed (I'm >> filtering by document dates), and I don't use "fq" to execute that queries. >> In this scenario, do you recommend to disable caches? > > I suspect that you may be running into a situation where you don't have > enough OS disk cache for your index. When you replicate, the new data that > has just been replicated pushes existing data out of the cache. You run your > query that is slow, and the *Solr* caches (not the same thing as the OS disk > cache) get populated, making later queries fast. You should be able to > configure autowarming on your Solr caches to help with this, but be aware > that autowarming can be time-consuming, and if you have replications > happening potentially every 30 seconds, you may find that your autowarming is > taking more time than that. This can lead to other problems. > > If the amount of disk space taken up by those 5 million documents is > significantly larger than the amount of memory available on the server that > is not allocated directly to programs like Solr itself, then the only true > solution will be to add memory to the server. > > Thanks, > Shawn >
Re: Facet field query on subset of documents
Hi Erick, Thanks for the reply and sorry, my fault, wasn't clear enough. I was wondering if there was a way to remove terms that would always be zero (because the term came from a document that didn't match the filter query). Here's an example. I have a bunch of documents with fields 'manufacturer' and 'location'. If I set my filter query to "manufacturer = Sony" and all Sony documents had a value of 'Florida' for location, then I want 'Florida' NOT to show up in my facet field results. Instead, it shows up with a count of zero (and it'll always be zero because of my filter query). Using mincount = 1 doesn't solve my problem because I don't want it to hide zeroes that came from documents that actually pass my filter query. Does that make more sense? On Thu, Nov 21, 2013 at 4:36 PM, Erick Erickson wrote: > That's what faceting does. The facets are only tabulated > for documents that satisfy they query, including all of > the filter queries and anh other criteria. > > Otherwise, facet counts would be the same no matter > what the query was. > > Or I'm completely misunderstanding your question... > > Best, > Erick > > > On Thu, Nov 21, 2013 at 4:22 PM, Luis Lebolo > wrote: > > > Hi All, > > > > Is it possible to perform a facet field query on a subset of documents > (the > > subset being defined via a filter query for instance)? > > > > I understand that facet pivoting might work, but it would require that > the > > subset be defined by some field hierarchy, e.g. manufacturer -> price > (then > > only look at the results for the manufacturer I'm interested in). > > > > What if I wanted to define a more complex subset (where the name starts > > with A but ends with Z and some other field is greater than 5 and yet > > another field is not 'x', etc.)? > > > > Ideally I would then define a "facet field constraining query" to include > > only terms from documents that pass this query. > > > > Thanks, > > Luis > > >
Facet field query on subset of documents
Hi All, Is it possible to perform a facet field query on a subset of documents (the subset being defined via a filter query for instance)? I understand that facet pivoting might work, but it would require that the subset be defined by some field hierarchy, e.g. manufacturer -> price (then only look at the results for the manufacturer I'm interested in). What if I wanted to define a more complex subset (where the name starts with A but ends with Z and some other field is greater than 5 and yet another field is not 'x', etc.)? Ideally I would then define a "facet field constraining query" to include only terms from documents that pass this query. Thanks, Luis
Cancel Solr query?
Hi All, Is it possible to cancel a Solr query/request currently in progress? Suppose the user starts searching for something (that takes a long time for Solr to process), then decides the modify the query. I can simply ignore the previous request and create a new request, but Solr is still processing the old request, correct? Is there any way to cancel that first request? Thanks, Luis
Problem querying large StrField?
Hi All, It seems that I can't query on a StrField with a large value (say 70k characters). I have a Solr document with a string type: and field: Note that it's stored, in case that matters. Across my documents, the length of the value in this StrField can be up to ~70k characters or more. The query I'm trying is 'someFieldName_1:*'. If someFieldName_1 has values with length < ~10k characters, then it works fine and I retrieve various documents with values in that field. However, if I query 'someFieldName_2:*' and someFieldName_2 has values with length ~60k, I don't get back any documents. Even though I *know* that many documents have a value in someFieldName_2. If I query *:* and add someFieldName_2 in the field list, I am able to see the (large) value in someFieldName_2. So is there some type of limit to the length of strings in StrField that I can query against? Thanks, Luis
Re: Problem querying large StrField?
Update: It seems I get the bad behavior (no documents returned) when the length of a value in the StrField is greater than or equal to 32,767 (2^15). Is this some type of bit overflow somewhere? On Wed, Feb 5, 2014 at 12:32 PM, Luis Lebolo wrote: > Hi All, > > It seems that I can't query on a StrField with a large value (say 70k > characters). I have a Solr document with a string type: > > > > and field: > > stored="true" /> > > Note that it's stored, in case that matters. > > Across my documents, the length of the value in this StrField can be up to > ~70k characters or more. > > The query I'm trying is 'someFieldName_1:*'. If someFieldName_1 has values > with length < ~10k characters, then it works fine and I retrieve various > documents with values in that field. > > However, if I query 'someFieldName_2:*' and someFieldName_2 has values > with length ~60k, I don't get back any documents. Even though I *know* that > many documents have a value in someFieldName_2. > > If I query *:* and add someFieldName_2 in the field list, I am able to see > the (large) value in someFieldName_2. > > So is there some type of limit to the length of strings in StrField that I > can query against? > > Thanks, > Luis >
Re: Problem querying large StrField?
Hi Yonik, Thanks for the response. Our use case is perhaps a little unusual. The actual domain is in bioinformatics, but I'll try to generalize. We have two types of entities, call them A's and B's. For a given pair of entities (a_i, b_j) we may or may not have an associated data value z. Standard many to many stuff in a DB. Users can select an arbitrary set of entities from A. What we'd then like to ask of Solr is: Which entities of type B have a data value for any of the A's I've selected. The way we've approached this to date is to index the set of B, such that each document has a multivalued field containing the id's of all entities A that have a data value. If I select a set of A (a1, a2, a5, a9), then I would query data availability across B as dataAvailabilityField:(a1 OR a2 OR a5 OR a9). The sets of A and B are fairly large (~10 - 30k). This was working ok, but our datasets have increased and now the giant OR is getting too slow. As an alternative approach, we developed a ValueParser plugin that took advantage of our ability to sort the list of entity id's and do some clever things, like binary searches and short circuits on the results. For this to work, we concatenated all the id's into a single comma delimited value. So the data availability field is now single valued, but has a term that looks like "a1,a3,a6,a7". Our function query then takes the list of A id's that we're interested in and searches the documents for ones that match any value. Worked great and quite fast when the id list was short enough. But then we tried it on the full data set and the indexed terms of id's are HUGE. I know it's a bit of an odd use case, but have you seen anything like this before? Do you have any thoughts on how we might better accomplish this functionality? Thanks! On Wed, Feb 5, 2014 at 1:42 PM, Yonik Seeley wrote: > On Wed, Feb 5, 2014 at 1:04 PM, Luis Lebolo wrote: > > Update: It seems I get the bad behavior (no documents returned) when the > > length of a value in the StrField is greater than or equal to 32,767 > > (2^15). Is this some type of bit overflow somewhere? > > I believe that's the maximum size of an indexed token. > Can you share your use-case? Why are you trying to index such large > values as a single token? > > -Yonik > http://heliosearch.org - native off-heap filters and fieldcache for solr >
Re: SOLR online reference document - WIKI
This page never came up on any of my Google searches, so thanks for the heads up! Looks good. -Luis On Tue, Jun 25, 2013 at 12:32 PM, Learner wrote: > I just came across a wonderful online reference wiki for SOLR and thought > of > sharing it with the community.. > > > https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/SOLR-online-reference-document-WIKI-tp4073110.html > Sent from the Solr - User mailing list archive at Nabble.com. >
CachedSqlEntityProcessor not adding fields
Hi All, I'm trying to use CachedSqlEntityProcessor in one of my sub-entities, but the field never gets populated. I'm using Solr 4.4. The field is a multi-valued field: The relevant part of my data-config.xml looks like: ... Let me know if you need more info. Any ideas appreciated! Thanks, Luis
Re: CachedSqlEntityProcessor not adding fields
I'm noticing some very odd behavior using dataimport from the Admin UI. Whenever I limit the number of rows to 75 or below, the aliases field never gets populated. As soon as I increase the limit to 76 or more, the aliases field gets populated! What am I not understanding here? On Tue, Jul 30, 2013 at 11:04 AM, Luis Lebolo wrote: > Hi All, > > I'm trying to use CachedSqlEntityProcessor in one of my sub-entities, but > the field never gets populated. I'm using Solr 4.4. The field is a > multi-valued field: > > The relevant part of my data-config.xml looks like: > > > > > > > > > > > > > > > cacheKey="ALIAS_MODEL_ID" cacheLookup="model.MODEL_ID"> > > > > ... > > > > > Let me know if you need more info. Any ideas appreciated! > > Thanks, > Luis >
DataImportHandler rows parameter and performance
Hi All, I'm using the Admin UI dataimport page to load some documents into my index. There's a rows parameter that you can leave blank (to load all documents). When I change it to the maximum number of documents, the performance drops by a factor of 10. For example, I have 1627 root entities. If I fill in row with 1627, indexing occurs at about 10 docs per second. If I leave it blank, it occurs at about 1 doc per second. Thanks, Luis
Query on all dynamic fields or wildcard field query
Hi All, First I have to apologize and admit that I'm asking this question before doing any real research =( Was hoping for some preliminary help before I start this endeavor tomorrow. So here goes: Can I query for a value in multiple (wildcarded) fields? For example, if I have dynamic fields fieldName_someToken (e.g. fieldName_1, fieldName_2, fieldName_3), can I construct a query like fieldName_*:someValue? The query itself doesn't work, but is there a way to query numerous dynamic fields without explicitly listing them? Thanks, Luis
SolrException parsing error
Hi All, I'm using Solr 4.1 and am receiving an org.apache.solr.common.SolrException "parsing error" with root cause java.io.EOFException (see below for stack trace). The query I'm performing is long/complex and I wonder if its size is causing the issue? I am querying via POST through SolrJ. The query (fq) itself is ~20,000 characters long in the form of: fq=(mutation_prot_mt_1_1:2374 + OR + mutation_prot_mt_2_1:2374 + OR + mutation_prot_mt_3_1:2374 + ...) + OR + (mutation_prot_mt_1_2:2374 + OR + mutation_prot_mt_2_2:2374 + OR + mutation_prot_mt_3_2:2374+...) + OR + ... In short, I am querying for an ID throughout multiple dynamically created fields (mutation_prot_mt_#_#). Any thoughts on how to further debug? Thanks in advance, Luis -- SEVERE: Servlet.service() for servlet [X] in context with path [/x] threw exception [Request processing failed; nested exception is org.apache.solr.common.SolrException: parsing error] with root cause java.io.EOFException at org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:193) at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:107) at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:387) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:90) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301) at x.x.x.x.x.x.someMethod(x.java:111) at x.x.x.x.x.x.otherMethod(x.java:222) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.springframework.web.method.support.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:213) at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:126) at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:96) at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:617) at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:578) at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:80) at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:923) at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852) at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:882) at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:778) at javax.servlet.http.HttpServlet.service(HttpServlet.java:621) at javax.servlet.http.HttpServlet.service(HttpServlet.java:722) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:305) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:330) at x.x.x.x.x.yetAnotherMethod(x.java:333) at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342) at org.springframework.security.web.access.intercept.FilterSecurityInterceptor.invoke(FilterSecurityInterceptor.java:118) at org.springframework.security.web.access.intercept.FilterSecurityInterceptor.doFilter(FilterSecurityInterceptor.java:84) at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342) at org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:113) at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342) at org.springframework.security.web.authentication.AnonymousAuthenticationFilter.doFilter(AnonymousAuthenticationFilter.java:113) at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342) at org.springframework.security.web.authentication.rememberme.RememberMeAuthenticationFilter.doFilter(RememberMeAuthenticationFilter.java:146) at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342) at org.springframework.security.web.servletapi.SecurityContextHolderAwareRequestFilter.doFilter(SecurityContextHolderAwareRequestFilter.java:54) at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterCha
Re: Query Parser OR AND and NOT
What if you try city:(*:* -H*) OR zip:30* Sometimes Solr requires a list of documents to subtract from (think of "*:* -someQuery" converts to "all documents without someQuery"). You can also try looking at your query with debugQuery = true. -Luis On Mon, Apr 15, 2013 at 12:25 PM, Peter Schütt wrote: > Hallo, > > > Roman Chyla wrote in > news:caen8dywjrl+e3b0hpc9ntlmjtrkasrqlvkzhkqxopmlhhfn...@mail.gmail.com: > > > should be: -city:H* OR zip:30* > > > -city:H* OR zip:30* numFound:2520 > > gives the same wrong result. > > > Another Idea? > > Ciao > Peter Schütt > > >
Re: SolrException parsing error [Solved]
Sorry, spoke to soon. Turns out I was not sending the query via POST. Changing the method to POST solved the issue. Apologies for the spam! -Luis On Mon, Apr 15, 2013 at 11:47 AM, Luis Lebolo wrote: > Hi All, > > I'm using Solr 4.1 and am receiving an > org.apache.solr.common.SolrException "parsing error" with root cause > java.io.EOFException (see below for stack trace). The query I'm performing > is long/complex and I wonder if its size is causing the issue? > > I am querying via POST through SolrJ. The query (fq) itself is ~20,000 > characters long in the form of: > > fq=(mutation_prot_mt_1_1:2374 + OR + mutation_prot_mt_2_1:2374 + OR + > mutation_prot_mt_3_1:2374 + ...) + OR + (mutation_prot_mt_1_2:2374 + OR + > mutation_prot_mt_2_2:2374 + OR + mutation_prot_mt_3_2:2374+...) + OR + ... > > In short, I am querying for an ID throughout multiple dynamically created > fields (mutation_prot_mt_#_#). > > Any thoughts on how to further debug? > > Thanks in advance, > Luis > > -- > > SEVERE: Servlet.service() for servlet [X] in context with path [/x] threw > exception [Request processing failed; nested exception is > org.apache.solr.common.SolrException: parsing error] with root cause > java.io.EOFException > at > org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:193) > at > org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:107) > at > org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:387) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) > at > org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:90) > at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301) > at x.x.x.x.x.x.someMethod(x.java:111) > at x.x.x.x.x.x.otherMethod(x.java:222) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at > org.springframework.web.method.support.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:213) > at > org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:126) > at > org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:96) > at > org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:617) > at > org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:578) > at > org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:80) > at > org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:923) > at > org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852) > at > org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:882) > at > org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:778) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:621) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:722) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:305) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) > at > org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:330) > at x.x.x.x.x.yetAnotherMethod(x.java:333) > at > org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342) > at > org.springframework.security.web.access.intercept.FilterSecurityInterceptor.invoke(FilterSecurityInterceptor.java:118) > at > org.springframework.security.web.access.intercept.FilterSecurityInterceptor.doFilter(FilterSecurityInterceptor.java:84) > at > org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342) > at > org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:113) > at > org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:342) > at > org.springframework.security.web.authentication.AnonymousAuthenticationFilter.doFilter(Anon
Re: SolrException parsing error
Turns out I spoke too soon. I was *not* sending the query via POST. Changing the method to POST solved the issue for me (maybe I was hitting a GET limit somewhere?). -Luis On Tue, Apr 16, 2013 at 7:38 AM, Marc des Garets wrote: > Did you find anything? I have the same problem but it's on update requests > only. > > The error comes from the solrj client indeed. It is solrj logging this > error. There is nothing in solr itself and it does the update correctly. > It's fairly small simple documents being updated. > > > On 04/15/2013 07:49 PM, Shawn Heisey wrote: > >> On 4/15/2013 9:47 AM, Luis Lebolo wrote: >> >>> Hi All, >>> >>> I'm using Solr 4.1 and am receiving an org.apache.solr.common.** >>> SolrException >>> "parsing error" with root cause java.io.EOFException (see below for stack >>> trace). The query I'm performing is long/complex and I wonder if its size >>> is causing the issue? >>> >>> I am querying via POST through SolrJ. The query (fq) itself is ~20,000 >>> characters long in the form of: >>> >>> fq=(mutation_prot_mt_1_1:2374 + OR + mutation_prot_mt_2_1:2374 + OR + >>> mutation_prot_mt_3_1:2374 + ...) + OR + (mutation_prot_mt_1_2:2374 + OR + >>> mutation_prot_mt_2_2:2374 + OR + mutation_prot_mt_3_2:2374+...) + OR + >>> ... >>> >>> In short, I am querying for an ID throughout multiple dynamically created >>> fields (mutation_prot_mt_#_#). >>> >>> Any thoughts on how to further debug? >>> >>> Thanks in advance, >>> Luis >>> >>> --** >>> >>> SEVERE: Servlet.service() for servlet [X] in context with path [/x] threw >>> exception [Request processing failed; nested exception is >>> org.apache.solr.common.**SolrException: parsing error] with root cause >>> java.io.EOFException >>> at >>> org.apache.solr.common.util.**FastInputStream.readByte(**FastInputStream.java:193) >>> >>> at org.apache.solr.common.util.**JavaBinCodec.unmarshal(** >>> JavaBinCodec.java:107) >>> at >>> org.apache.solr.client.solrj.**impl.BinaryResponseParser.** >>> processResponse(**BinaryResponseParser.java:41) >>> at >>> org.apache.solr.client.solrj.**impl.HttpSolrServer.request(**HttpSolrServer.java:387) >>> >>> at >>> org.apache.solr.client.solrj.**impl.HttpSolrServer.request(**HttpSolrServer.java:181) >>> >>> at >>> org.apache.solr.client.solrj.**request.QueryRequest.process(**QueryRequest.java:90) >>> >>> at org.apache.solr.client.solrj.**SolrServer.query(SolrServer.** >>> java:301) >>> >> >> I am guessing that this log is coming from your SolrJ client, but That is >> not completely clear, so is it SolrJ or Solr that is logging this error? >> If it's SolrJ, do you see anything in the Solr log, and vice versa? >> >> This looks to me like a network problem, where something is dropping the >> connection before transfer is complete. It could be an unusual server-side >> config, OS problems, timeout settings in the SolrJ code, NIC >> drivers/firmware, bad cables, bad network hardware, etc. >> >> Thanks, >> Shawn >> >> >
SolrJ Custom RowMapper
Hi All, Does SolrJ have an option for a custom RowMapper or BeanPropertyRowMapper (I'm using Spring/JDBC terms). I know the QueryResponse has a getBeans method, but I would like to create my own mapping and plug it in. Any pointers? Thanks, Luis
SolrDocument getFieldNames() exclude dynamic fields?
Hi All, I'm using SolrJ's QueryResponse to retrieve all SolrDocuments from a query. When I use SolrDocument's getFieldNames(), I get back a list of fields that excludes dynamic fields (even though I know they are not empty). Is there a way to get a list of all fields for a given SolrDocument? Thanks, Luis
Re: SolrDocument getFieldNames() exclude dynamic fields?
Apologies, I wasn't storing these dynamic fields. On Fri, Apr 26, 2013 at 11:01 AM, Luis Lebolo wrote: > Hi All, > > I'm using SolrJ's QueryResponse to retrieve all SolrDocuments from a > query. When I use SolrDocument's getFieldNames(), I get back a list of > fields that excludes dynamic fields (even though I know they are not empty). > > Is there a way to get a list of all fields for a given SolrDocument? > > Thanks, > Luis >
Re: Add copyTo Field without re-indexing?
Hello. You can also develop an application by yourself that uses Solrj to retrieve all the documents from your índex, process and add all the new information (fields) desired and the index them into another Solr index. Its easy. Goodbye! El 16/09/2011, a las 17:39, "Olson, Ron" escribió: > Hi all- > > I have an 11 gig index that I realize I need to add another field to, but not > from the actual query using DIH, but via copyTo. > > Is there any way to re-parse an existing index, adding the new copyTo field, > without having to basically start all over again with DIH? > > Thanks, > > Ron > > DISCLAIMER: This electronic message, including any attachments, files or > documents, is intended only for the addressee and may contain CONFIDENTIAL, > PROPRIETARY or LEGALLY PRIVILEGED information. If you are not the intended > recipient, you are hereby notified that any use, disclosure, copying or > distribution of this message or any of the information included in or with it > is unauthorized and strictly prohibited. If you have received this message > in error, please notify the sender immediately by reply e-mail and > permanently delete and destroy this message and its attachments, along with > any copies thereof. This message does not create any contractual obligation > on behalf of the sender or Law Bulletin Publishing Company. > Thank you.
Distributed search has problems with some field names
Hello all, I'm experimenting with the "Distributed Search" bits in the nightly builds and I'm facing a problem. I have on my schema.xml some dynamic fields defined like this: multiValued="true" /> When hitting a single shard the following query works fine: http:///select?q=*:*&fl=ts,$distinct_boxes But when I add the "&distrib=true" parameter I get a NullPointerException: java.lang.NullPointerException at org.apache.solr.handler.component.QueryComponent.returnFields(QueryComponent.java:1025) at org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:725) at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:700) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:292) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1451) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) The "$" in "$distinct_boxes" appears to be the culprit somehow, the query: /select?q=*:*&fl=ts,distinct_boxes&distrib=true> works without errors, but of course doesn't retrieve the field I want. Funnily enough when requesting the uniqueKey field there are no errors: /select?q=*:*&fl=tid,ts,$distinct_boxes&distrib=true> But somehow the data from the field "$distinct_boxes" doesn't appear in the output. Is there some workaround? Using "fl=*" returns all the data from the fields that start with "$" but it severely increases the size of the response. -- Luis Neves
Re: Distributed search has problems with some field names
Hi, On 09/29/2011 03:10 PM, Erick Erickson wrote: I know I've seen other anomalies with odd characters in field names. In general, it's much safer to use only letters, numbers, and underscores. In fact, I even prefer lowercase letters. Since you're pretty sure those work, why not just use them? Yes, that's what I ended up doing, but it involved a reindex. I was trying to avoid that. Thanks! -- Luis Neves
r1201855 broke stats.facet on long fields
Hello, I've a "long" field defined in my schema: omitNorms="true" positionIncrementGap="0" /> Before r1201855 I could use "stats.facet=ts" which allowed me to have a timeseries of sorts, now I get an error: "Stats can only facet on single-valued fields, not: ts[long{class=org.apache.solr.schema.TrieLongField,analyzer=org.apache.solr.analysis.TokenizerChain,args={precisionStep=0, positionIncrementGap=0, omitNorms=true}}]" Is there any hope of having the old behavior back? Looking at the changed code I see this: if (facetFieldType.isTokenized() || facetFieldType.isMultiValued()) { throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "Stats can only facet on single-valued fields, not: " + facetField + "[" + facetFieldType + "]"); } this seem to also "fix" SOLR-1782. -- Luis Neves
Re: r1201855 broke stats.facet on long fields
On 12/08/2011 11:16 PM, Chris Hostetter wrote: ...so if you don't have a version param, or your version param is "1.0" then that would explain this error I have the version param set to "1.4". (If that doens't fix the problem for you. It doesn't. > then i'm genuinely baffled, and please file a Jira bug with as much details as possible about your setup (ideally a fully usable solrconfig.xml+schema.xml that demonstrates your problem) because the StatsComponentTest most certainly already tests that stats can be computed on a multiValued="false" TrieLongField with precisionsStep="0") I will try to set up a reproducible test case. Thanks! -- Luis Neves.
Querying for ~2000 integers - better model?
Hello! First time poster so {insert ignorance disclaimer here ;)}. I'm building a web application backed by an Oracle database and we're using Lucene Solr to index various lists of "entities" (via DIH). We then harness Solr's faceting to allow the user to filter through their searches. One aspect we're having trouble modeling is the concept of data availability. A dataset will have a data value for various entity pairs. To generalize, say we have two entities: Apples and Oranges. Therefore, there's a data value for various Apple and Orange pairs (e.g. apple1 & orange5 have value 6.566). The question we want to model is "which Apples have data for a specific set of Oranges." The problem is that the list of Oranges can be ~2000. Our first (and albeit ugly) approach was to create a dataAvailability field in each Apple document. It's a multi-valued field that holds a list of Oranges (actually a list of Orange IDs) that have data for that specific Apple. Our facet query then becomes ...facet.query=dataAvailability:(1 OR 2 OR 4 OR 45 OR 200 OR ...)... For > 1000 Oranges, the query takes a long time to run the first time a user performs it (afterwards it gets cached so it runs fairly quickly). Any thoughts on how to speed this up? Is there a better model to use? One idea was to use the autowarming features. However, the list of Oranges will always be dynamically built by the user (and it's not feasible to autowarm all possible permutations of ~2000 Oranges =)). Hope the generalization isn't too stupid, and thanks in advance! Cheers, Luis
Re: Querying for ~2000 integers - better model?
Hi Mikhail, Thanks for the interest! The user selects various Oranges from the website. The list of Orange IDs then gets placed into a table in our database. For example, the user may want to search oranges from Florida (a state filter) planted a week ago (a data filter). We then display 600 Oranges that fit this query and the user says "select them all". We then store all 600 IDs in our database. For the data availability filter, we get the list of Orange IDs from the database first then use SolrJ to create the facet query. -Luis On Tue, Feb 5, 2013 at 12:03 PM, Mikhail Khludnev < mkhlud...@griddynamics.com> wrote: > Hello Luis, > > Your problem seems fairly obvious (hard to solve problem). > Where these set of orange id come from? Does an user enter thousand of > these ids into web-form? > > > On Tue, Feb 5, 2013 at 8:49 PM, Luis Lebolo wrote: > > > Hello! First time poster so {insert ignorance disclaimer here ;)}. > > > > I'm building a web application backed by an Oracle database and we're > using > > Lucene Solr to index various lists of "entities" (via DIH). We then > harness > > Solr's faceting to allow the user to filter through their searches. > > > > One aspect we're having trouble modeling is the concept of data > > availability. A dataset will have a data value for various entity pairs. > To > > generalize, say we have two entities: Apples and Oranges. Therefore, > > there's a data value for various Apple and Orange pairs (e.g. apple1 & > > orange5 have value 6.566). > > > > The question we want to model is "which Apples have data for a specific > set > > of Oranges." The problem is that the list of Oranges can be ~2000. > > > > Our first (and albeit ugly) approach was to create a dataAvailability > field > > in each Apple document. It's a multi-valued field that holds a list of > > Oranges (actually a list of Orange IDs) that have data for that specific > > Apple. > > > > Our facet query then becomes ...facet.query=dataAvailability:(1 OR 2 OR 4 > > OR 45 OR 200 OR ...)... > > > > For > 1000 Oranges, the query takes a long time to run the first time a > > user performs it (afterwards it gets cached so it runs fairly quickly). > Any > > thoughts on how to speed this up? Is there a better model to use? > > > > One idea was to use the autowarming features. However, the list of > Oranges > > will always be dynamically built by the user (and it's not feasible to > > autowarm all possible permutations of ~2000 Oranges =)). > > > > Hope the generalization isn't too stupid, and thanks in advance! > > > > Cheers, > > Luis > > > > > > -- > Sincerely yours > Mikhail Khludnev > Principal Engineer, > Grid Dynamics > > <http://www.griddynamics.com> > >
FunctionQuery does not work as advertised
Hello all. I have the need to include the result of a computed value in the search results of solr query and sort by that value. The documentation about FunctionQuery available at: <http://wiki.apache.org/solr/FunctionQuery> states that this is possible (see the "General Example" at the bottom), but I'm unable to make it work. Using solr1.3 and the included example application this is what I get: Query: http://localhost:8983/solr/select?q=id:SP2514N&fl=id,popularity,score> numFound=1 score=3.5649493 But for the Query: <http://localhost:8983/solr/select?q=id:SP2514N _val_:"pow(popularity,2)"&fl=id,popularity,score>, the expected results were: numFound=1 score=36 and what I get instead is: numFound=26 score=13.155498 (for doc with id=SP2514N) This is surprising for two reasons: -The score value is not the square of the "popularity" field. -The result set cardinality is altered by the use of the FunctionQuery and I was under the impression that functions changed the ordering of the results but had no effect on the actual number of matched documents. Is this a bug in solr or the is documentation at fault? Am I missing something? Is there any way to include a computed value in the search results and sort by it? Thanks in advance. -- /** * Luis Neves * @e-mail: luis.ne...@co.sapo.pt * @xmpp: lfs_ne...@sapo.pt * @web: <http://technotes.blogs.sapo.pt/> * @tlm: +351 962 057 656 */
OOM when autowarming is enabled
Hello all. We are having some issues with one of our Solr instances when autowarming is enabled. The index has about 2.2M documents and 2GB of size, so it's not particularly big. Solr runs with "-Xmx1024M -Xms1024M". We are constantly inserting and updating the index, about 20 new/updated documents per minute, with a commit every 10 minutes. These are our cache settings: autowarmCount="256"/> autowarmCount="256"/> autowarmCount="0"/> When the autowarming is disabled there are no OOM errors, but the first search after a commit takes ~10 seconds and that is too long. I've enabled the "-XX:+HeapDumpOnOutOfMemoryError" flag. If this happen again I will be able to produce a headdump for analysis... meanwhile is there any setting that we can tweak that is easier on the memory and still manages to make the first search after a commit return in a reasonable time? Thanks! -- Luis Neves StackTrace: Error during auto-warming of key:[EMAIL PROTECTED]:java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:104) at org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:159) at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:165) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:153) at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:54) at org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:429) at org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:380) at org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:383) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72) at org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:350) at org.apache.solr.search.function.ReverseOrdFieldSource.getValues(ReverseOrdFieldSource.java:56) at org.apache.solr.search.function.ReciprocalFloatFunction.getValues(ReciprocalFloatFunction.java:57) at org.apache.solr.search.function.LinearFloatFunction.getValues(LinearFloatFunction.java:49) at org.apache.solr.search.function.FunctionQuery$AllScorer.(FunctionQuery.java:100) at org.apache.solr.search.function.FunctionQuery$FunctionWeight.scorer(FunctionQuery.java:78) at org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:233) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:143) at org.apache.lucene.search.Searcher.search(Searcher.java:118) at org.apache.lucene.search.Searcher.search(Searcher.java:97) at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:888) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:805) at org.apache.solr.search.SolrIndexSearcher.access$1(SolrIndexSearcher.java:709) at org.apache.solr.search.SolrIndexSearcher$2.regenerateItem(SolrIndexSearcher.java:251) at org.apache.solr.search.LRUCache.warm(LRUCache.java:193) at org.apache.solr.search.SolrIndexSearcher.warm(SolrIndexSearcher.java:1385) at org.apache.solr.core.SolrCore$1.call(SolrCore.java:488) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) at java.lang.Thread.run(Thread.java:619)
Re: OOM when autowarming is enabled
Yonik Seeley wrote: On 7/25/07, Luis Neves <[EMAIL PROTECTED]> wrote: We are having some issues with one of our Solr instances when autowarming is enabled. The index has about 2.2M documents and 2GB of size, so it's not particularly big. Solr runs with "-Xmx1024M -Xms1024M". "Big" is relative to what you are trying to do (faceting, sorting, etc). Good point. We don't use faceting or sorting in this particular index. From the stack trace it looks like a function query is the last straw... it causes a FieldCache entry to be populated, just like sorting would. Depending on the number of unique terms in the field, and the number of fields you sort on or do function queries on, it can take quite a bit of memory. I see ... we use the DismaxQueryHandler and the bf parameter is set like: linear(recip(rord(EntryDate),1,1000,1000),11,0) The objective is to boost the documents by "freshness" ... this is probably the cause of the memory abuse since all the "EntryDate" values are unique. I will try to use something like: EntryDate:[* TO NOW/DAY-3MONTH]^1.5 Thanks!! -- Luis Neves
Re: OOM when autowarming is enabled
Luis Neves wrote: The objective is to boost the documents by "freshness" ... this is probably the cause of the memory abuse since all the "EntryDate" values are unique. I will try to use something like: EntryDate:[* TO NOW/DAY-3MONTH]^1.5 This turn out to be a bad idea ... for some reason using the BoostQuery instead of the BoostFunction slows the search to a crawl. -- Luis Neves
Re: OOM when autowarming is enabled
Yonik Seeley wrote: On 7/25/07, Luis Neves <[EMAIL PROTECTED]> wrote: This turn out to be a bad idea ... for some reason using the BoostQuery instead of the BoostFunction slows the search to a crawl. Dismax throws bq in with the main query, so it can't really be cached separately, so iterating over the number of terms in [* TO NOW/DAY-3MONTH] for each query is expensive. Ok. You could try lowering the resolution of EntryDate to lower the number of unique terms (but that would require reindexing). That would speed up a range query, or lower the memory usage of the FieldCache entry. Solr could also somehow be smarter about the FieldCache and only cache the ordinal and not the actual values (this could apply to sorting too). Lucene's FieldCache doesn't currently support that though, so it would require some hacking. If you didn't want date math, date faceting, or date ranges, you could simply store a date as a classic integer (number of seconds since epoch). function queries would still work on this, and the FieldCache would be 4 bytes per doc. I will do a combination of both, I will add a new int field to the index and use it to hold the number of weeks since epoch (week resolution is good enough for freshness in our case). Thanks! -- Luis Neves
help with dismax query handler syntax
Hello all, Using the standard query handler I can search for a term excluding a category and sort descending by price, e.g.: http://localhost/solr/select/?q=book+-Category:Adults;Price+desc&start=0&rows=10&fl=*,score I'm scratching my head on how to do the same with the Dismax query handler, can anyone point me in the right direction. Thanks! -- Luis Neves
Re: help with dismax query handler syntax
Nevermind, I got it ... Somehow I missed the javadoc. -- Luis Neves Luis Neves wrote: Hello all, Using the standard query handler I can search for a term excluding a category and sort descending by price, e.g.: http://localhost/solr/select/?q=book+-Category:Adults;Price+desc&start=0&rows=10&fl=*,score I'm scratching my head on how to do the same with the Dismax query handler, can anyone point me in the right direction. Thanks! -- Luis Neves
Varying the score acording to search word
Hello all. We have a product catalog that is searchable via Solr, by default we want to exclude results from the "Adult" category unless the search terms match a predetermined list of words. Example: Client searches for "doll", "doll" is not on the list -> we *don't want* to show him Adult results. Client searched for "aphrodisiac", "aphrodisiac" is on the list -> we *want* to show him Adult results. Did I made sense? FunctionQuery seems to be what I want, but it's not clear to me how to use it in this particular case. Can anyone point me in the right direction? Thanks! Luis Neves
Re: Varying the score acording to search word
Mental note: think before post ... this is a simple job for a Servlet filter. sorry for the noise. -- Luis Neves Luis Neves wrote: Hello all. We have a product catalog that is searchable via Solr, by default we want to exclude results from the "Adult" category unless the search terms match a predetermined list of words. Example: Client searches for "doll", "doll" is not on the list -> we *don't want* to show him Adult results. Client searched for "aphrodisiac", "aphrodisiac" is on the list -> we *want* to show him Adult results. Did I made sense? FunctionQuery seems to be what I want, but it's not clear to me how to use it in this particular case. Can anyone point me in the right direction? Thanks! Luis Neves
Re: result grouping?
Yonik Seeley wrote: On 1/3/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: thanks. Yes, the presentation layer could group results, but that is not practical if i want to show the first 20 results out of 200,000 matches. Nutch groups the results by site. Any idea how they do it? Good question. Off the top of my head, one could use a priority queue that can change it's size dynamically. One could increment a group count for each hit (like faceted search with the FieldCache) and if the group count exceeds "n", then you increment the size of the priority queue to allow an additional item to be collected to compensate. -Yonik You might as wheel say that I have to change the dilithium crystals in the flux capacitor :-) One of the reasons I like Solr so much is because I get impressive results without having to know Lucene, which is something that will have to change because I also need this feature. Not knowing much about the internal of Solr/Lucene I had a look at the Facet code in search of ideas, but from what I could see the facet counts are calculated after the Documents are added to the response, it seems to me that any kind of grouping has to be done before that... right? Could you explain in more detail where should I look? Can the TopFieldDocCollector/TopFieldDocs classes be used to this end? I'm immersing my self on Lucene but it will take some time. Side note: Over here, beside Solr, we also use the "FAST" search platform and they call this feature "Field collapsing": <http://www.fastsearch.com/glossary.aspx?m=48&amid=299> I like the syntax they use: "&collapseon=&collapsenum=N" -> Collapse, but keep N number of collapsed documents For some reason they can only collapse on numeric fields (int32). Regards, Luis Neves
Re: is search possible while indexing?
Rafeek Raja wrote: I am beginner to solr and lucene. Is search possible while indexing? Yes... that is just one of the cool features of Solr/Lucene. <http://incubator.apache.org/solr/features.html> -- Luis Neves
Re: result grouping?
Yonik Seeley wrote: There are still some things underspecified though. Let's take an example of collapseon=site, collapsenum=2 The list of un-collapsed matches and their relevancy scores (sort order) is: doc=51, site=A, score=100 doc=52, site=B, score=90 doc=53, site=C, score=80 doc=54, site=B, score=70 doc=55, site=D, score=60 doc=56, site=E, score=50 doc=57, site=B, score=40 doc=58, site=A, score=30 1) If I ask for the top 4 docs, should I get [51,52,53,54] or [51,52,54,53]. Are lower ranking docs moved up in the rankings to be in their higher ranking "group"? The docs move up the ranking. You should get [51,58,52,54] ... or one could make the case that you should get [51,58,52,54,53,55], to get the somewhat equivalent behaviour of a SQL "quota-query", in that case that case the "top 4" would not refer to the number of documents but the number of distinct values for the field you are collapsing. 2) If I ask for the top 3 docs, should I get [51,52,53] because those are the top 3 scoring docs, or should I get [51,58,52] because documents were first groups and then ranked (and 51 and 58 go together)? Another way of asking this is related to (1): should docs outside the "window" be moved up in the rankings to be in their higher ranking "group"? See above. 3) Should the number of documents in a "group" change the relevancy? Should site=B rank higher than site=A? I don't think so... don't know if that is what *should* be done, but that's not what FAST does. 4) Is the collapsing only in the returned results, or just within a page of results. If I ask for docs 4 through 7, should doc 57 be in that list or not? With "FAST" that is an option, the default behaviour is to remove the documents from the resultset and the 57 would not be on the list, but you can choose to not remove them and in that case they are presented last. Defining things to make sense while retaining the ability to page through the results seems to be the challenge. I'm beginning to think that this a little to complex for a first project with Lucene. In my particular case all I want is to group results by category (from a predetermined - and small - category list), I think I will just make a request by category and accept the latency. -- Luis Neves
XML querying
Hello. What I do now to index XML documents it's to use a Filter to strip the markup, this works but it's impossible to know where in the document is the match located. What would it take to make possible to specify a filter query that accepts xpath expressions?... something like: fq=xmlField:/book/content/text() This way only the "/book/content/" element was searched. Did I make sense? Is this possible? -- Luis Neves
Re: XML querying
Hi! Thorsten Scherler wrote: On Mon, 2007-01-15 at 12:23 +, Luis Neves wrote: Hello. What I do now to index XML documents it's to use a Filter to strip the markup, this works but it's impossible to know where in the document is the match located. What would it take to make possible to specify a filter query that accepts xpath expressions?... something like: fq=xmlField:/book/content/text() This way only the "/book/content/" element was searched. Did I make sense? Is this possible? AFAIK short answer: no. The field is ALWAYS plain text. There is no xmlField type. ...but why don't you just add your text in multiple field when indexing. Instead of plain stripping the markup do above xpath on your document and create different fields. Like Makes sense? Yes, but I have documents with different schemas on the same "xml field", also, that way I would have to know the schema of the documents being indexed (which I don't). The schema I use is something like: Where each distinct DocumentType has its own schema. I could revise this approach to use an Solr instance for each DocumentType but I would have to find a way to "merge" results from the different instances because I also need to search across different DocumentTypes... I guess I'm SOL :-( -- Luis Neves
Re: XML querying
Hi, Thorsten Scherler wrote: On Mon, 2007-01-15 at 13:42 +, Luis Neves wrote: I think you should explain your use case a wee bit more. What I do now to index XML documents it's to use a Filter to strip the markup, this works but it's impossible to know where in the document is the match located. why do you need to know where? Poorly phrased from my part. Ideally I want to apply "lucene filters" to the xml content. Something like what Nux does: <http://dsd.lbl.gov/nux/api/nux/xom/pool/FullTextUtil.html> -- Luis Neves
Document "freshness" and Boost Functions
Hello, Reading the javadocs from the DisMaxRequestHandler I see that is possible to use "Boost Functions" to influence the score. How would that work in order to improve the score of recent documents? (I have a timestamp field in the schema)... I'm assuming it's possible (right?), but I can't figure out the syntax. -- Luis Neves
Increment field value
Hello all, We have a Solr/Lucene index for newspaper articles, those articles have associated comments. When searching for articles we want to present the number of comments per article. What we do now is to fetch from the DB the sum of comments for each articleId that Solr returns, but this is bringing the DB to its knees. We would like to store the number of comments in the Solr index to save the DB some work. Is it possible when updating a numeric field to increment the existing value instead of replacing it with a new value? The problem we are having is that we can't retrieve the number of comments increment it and update the index because the "actual" value might be uncommitted... is there any other alternative to this problem? Thanks in advance for any help. -- Luis Neves
Re: Increment field value
I forgot one little detail. The DB server is untouchable. I have "read-only" access to it. The database is a component of an big "enterprisy" CMS. The obvious solution of adding a "#Posts" field to the table updated with a trigger is not viable. We have a ticket open with the vendor, but they are not what we could call agile. -- Luis Neves Luis Neves wrote: Hello all, We have a Solr/Lucene index for newspaper articles, those articles have associated comments. When searching for articles we want to present the number of comments per article. What we do now is to fetch from the DB the sum of comments for each articleId that Solr returns, but this is bringing the DB to its knees. We would like to store the number of comments in the Solr index to save the DB some work. Is it possible when updating a numeric field to increment the existing value instead of replacing it with a new value? The problem we are having is that we can't retrieve the number of comments increment it and update the index because the "actual" value might be uncommitted... is there any other alternative to this problem? Thanks in advance for any help. -- Luis Neves
Parsing cluster result's docs
Hi, I have a Solr instance using the clustering component (with the Lingo algorithm) working perfectly. However when I get back the cluster results only the ID's of these come back with it. What is the easiest way to retrieve full documents instead? Should I parse these IDs into a new query to Solr, or is there some configuration I am missing to return full docs instead of IDs? If it matters, I am using Solr 4.10. Thanks.
Real-Time get and Dynamic Fields: possible bug.
Hi there, I have the following dynamicFields definition in my schema.xml: I' ve seen that when fetching documents with /select?q=id:whateverId, the results returned include both i18n* and *_facet fields filled. However, when using real-time request handler (/get?ids:whateverIds) the result fetched include only i18n* dynamic fields, but *_facet ones are not included. I have the impression during /get RequestHandler the server-side regular expression used when parsing fields and fields values to return documents with existing dynamic fields seems to be wrong. From the client side, I' ve checked that the class DocField.java that parses SolrDocument to Bean ones uses the following matcher: } else if (annotation.value().indexOf('*') >= 0) { // dynamic fields are annotated as @Field("categories_*") // if the field was annotated as a dynamic field, convert the name into a pattern // the wildcard (*) is supposed to be either a prefix or a suffix, hence the use of replaceFirst name = annotation.value().replaceFirst("\\*", "\\.*"); dynamicFieldNamePatternMatcher = Pattern.compile("^" + name + "$"); } else { name = annotation.value(); } So maybe a similar behavior from the server-side is wrong. That' s the only reason I find to understand why when using /select all fields are returned but when using /get those that matches *_facet regexp are not. If you can confirm that this is a bug (because maybe is the expected behavior, but after some years using Solr I think it is not) I can create the JIRA issue and debug it more deeply to apply a patch with the aim to help. Regards, -- - Luis Cappa
Re: Real-Time get and Dynamic Fields: possible bug.
Ehem, *_target ---> *_facet. 2015-05-14 16:47 GMT+02:00 Luis Cappa Banda : > Hi Yonik, > > Yes, they are the target from copyFields in the schema.xml. This *_target > fields are suposed to be used in some specific searchable (thus, tokenized) > fields that in the future are candidates to be faceted to return some > stats. For example, imagine that you have a field storing a directory path > and you want to search by. Also, you may want to facet by the whole > directory path value (not just their terms). Thats why I' m storing both > field values: searchable and tokenized one, string and 'facet candidate' > one. > > What I do not understand is that both i18n* and *_target are dynamic, > indexed and stored values. The only difference is that *_target one is > multivalued. Does it have some sense? > > > Regards > > > - Luis Cappa > > 2015-05-14 16:42 GMT+02:00 Yonik Seeley : > >> Are the _facet fields the target of a copyField in the schema? >> Realtime get either gets the values from the transaction log (and if >> you didn't send it the values, they won't be there) or gets them from >> the index to try and reconstruct what was sent in. >> >> It's generally not recommended to have copyField targets "stored", or >> have a mix of explicitly set values and copyField values in the same >> field. >> >> -Yonik >> >> On Thu, May 14, 2015 at 7:17 AM, Luis Cappa Banda >> wrote: >> > Hi there, >> > >> > I have the following dynamicFields definition in my schema.xml: >> > >> > >> > >> > >> > > /> > indexed= >> > "true" stored="true" multiValued="true" /> >> > >> > >> > I' ve seen that when fetching documents with /select?q=id:whateverId, >> the >> > results returned include both i18n* and *_facet fields filled. However, >> > when using real-time request handler (/get?ids:whateverIds) the result >> > fetched include only i18n* dynamic fields, but *_facet ones are not >> > included. >> > >> > I have the impression during /get RequestHandler the server-side regular >> > expression used when parsing fields and fields values to return >> documents >> > with existing dynamic fields seems to be wrong. From the client side, >> I' ve >> > checked that the class DocField.java that parses SolrDocument to Bean >> ones >> > uses the following matcher: >> > >> > } else if (annotation.value().indexOf('*') >= 0) { // dynamic fields >> are >> > annotated as @Field("categories_*") >> > >> > // if the field was annotated as a dynamic field, convert the name into >> a >> > pattern >> > >> > // the wildcard (*) is supposed to be either a prefix or a suffix, hence >> > the use of replaceFirst >> > >> > name = annotation.value().replaceFirst("\\*", "\\.*"); >> > >> > dynamicFieldNamePatternMatcher = Pattern.compile("^" + name + "$"); >> > >> > } else { >> > >> > name = annotation.value(); >> > >> > } >> > >> > So maybe a similar behavior from the server-side is wrong. That' s the >> only >> > reason I find to understand why when using /select all fields are >> returned >> > but when using /get those that matches *_facet regexp are not. >> > >> > If you can confirm that this is a bug (because maybe is the expected >> > behavior, but after some years using Solr I think it is not) I can >> create >> > the JIRA issue and debug it more deeply to apply a patch with the aim to >> > help. >> > >> > >> > Regards, >> > >> > >> > -- >> > - Luis Cappa >> > > > > -- > - Luis Cappa > -- - Luis Cappa
Re: Real-Time get and Dynamic Fields: possible bug.
Hi Yonik, Yes, they are the target from copyFields in the schema.xml. This *_target fields are suposed to be used in some specific searchable (thus, tokenized) fields that in the future are candidates to be faceted to return some stats. For example, imagine that you have a field storing a directory path and you want to search by. Also, you may want to facet by the whole directory path value (not just their terms). Thats why I' m storing both field values: searchable and tokenized one, string and 'facet candidate' one. What I do not understand is that both i18n* and *_target are dynamic, indexed and stored values. The only difference is that *_target one is multivalued. Does it have some sense? Regards - Luis Cappa 2015-05-14 16:42 GMT+02:00 Yonik Seeley : > Are the _facet fields the target of a copyField in the schema? > Realtime get either gets the values from the transaction log (and if > you didn't send it the values, they won't be there) or gets them from > the index to try and reconstruct what was sent in. > > It's generally not recommended to have copyField targets "stored", or > have a mix of explicitly set values and copyField values in the same > field. > > -Yonik > > On Thu, May 14, 2015 at 7:17 AM, Luis Cappa Banda > wrote: > > Hi there, > > > > I have the following dynamicFields definition in my schema.xml: > > > > > > > > > > > indexed= > > "true" stored="true" multiValued="true" /> > > > > > > I' ve seen that when fetching documents with /select?q=id:whateverId, the > > results returned include both i18n* and *_facet fields filled. However, > > when using real-time request handler (/get?ids:whateverIds) the result > > fetched include only i18n* dynamic fields, but *_facet ones are not > > included. > > > > I have the impression during /get RequestHandler the server-side regular > > expression used when parsing fields and fields values to return documents > > with existing dynamic fields seems to be wrong. From the client side, I' > ve > > checked that the class DocField.java that parses SolrDocument to Bean > ones > > uses the following matcher: > > > > } else if (annotation.value().indexOf('*') >= 0) { // dynamic fields are > > annotated as @Field("categories_*") > > > > // if the field was annotated as a dynamic field, convert the name into a > > pattern > > > > // the wildcard (*) is supposed to be either a prefix or a suffix, hence > > the use of replaceFirst > > > > name = annotation.value().replaceFirst("\\*", "\\.*"); > > > > dynamicFieldNamePatternMatcher = Pattern.compile("^" + name + "$"); > > > > } else { > > > > name = annotation.value(); > > > > } > > > > So maybe a similar behavior from the server-side is wrong. That' s the > only > > reason I find to understand why when using /select all fields are > returned > > but when using /get those that matches *_facet regexp are not. > > > > If you can confirm that this is a bug (because maybe is the expected > > behavior, but after some years using Solr I think it is not) I can create > > the JIRA issue and debug it more deeply to apply a patch with the aim to > > help. > > > > > > Regards, > > > > > > -- > > - Luis Cappa > -- - Luis Cappa
Re: Real-Time get and Dynamic Fields: possible bug.
That is something I didin' t know, but I thought it was mandatory. I' ll try to explain step by step my (I think) logical way to understand it: - If a field is indexed, you can search by it. - When faceting, you have to index the field (because it can be tokenized and then you would like to facet by their terms). Then, you need to mark as indexed those fields you want to facet by. - If you mark as stored a field, you can return its value with the 'original value' it was stored. - If you facet, you are searching, counting terms and returning values and their counters. Thus, that "returning their values" step is what I thought where 'stored=true' was necessary. If you don' t mark as stored a field indexed and 'facetable', I was expecting to not be able to return their values, so faceting has no sense. Thats what I thought, of course. If it is not necessary, thats perfect: the lighter the data, the better, and one more thing I' ve learned, :-) Anyway, I think that the question is still open: both are dynamic fields, stored (it is not necessary, OK) and indexed. When applying real time requestHandler, i18n* dynamic fields are returned but those *_facet are not. However, when applying the default /select requestHandler and finding by the document id, both i18n* and *_facet fields are returned. You can try it with Solr 5.1, the version I' m currently using. The only differences between them are: - Regular expression: i18n* VS *_facet - Multivalued: *_facet are multivalued. Regards, - Luis Cappa 2015-05-14 18:32 GMT+02:00 Yonik Seeley : > On Thu, May 14, 2015 at 10:47 AM, Luis Cappa Banda > wrote: > > Hi Yonik, > > > > Yes, they are the target from copyFields in the schema.xml. This *_target > > fields are suposed to be used in some specific searchable (thus, > tokenized) > > fields that in the future are candidates to be faceted to return some > > stats. For example, imagine that you have a field storing a directory > path > > and you want to search by. Also, you may want to facet by the whole > > directory path value (not just their terms). Thats why I' m storing both > > field values: searchable and tokenized one, string and 'facet candidate' > > one. > > OK, but you don't need to *store* the values in _facet, right? > -Yonik > -- - Luis Cappa
Re: Real-Time get and Dynamic Fields: possible bug.
Yep, but those dynamic fields had a field type "string", so the unique indexed therm will be the entire field value and the faceted terms counted will match with exactly with each field value. Thats why I was confused. Typically I use faceting with string non tokenized field values for simple stats and this kind of things. Do you think the behavior explained (I mean, ghost dynamic field values when using real-time request handler) can be a bug? I don' t mind investigating it this weekend and trying to patch it. 2015-05-14 18:59 GMT+02:00 Yonik Seeley : > On Thu, May 14, 2015 at 12:49 PM, Luis Cappa Banda > wrote: > > If you don' t mark as stored a field indexed and 'facetable', I was > > expecting to not be able to return their values, so faceting has no > sense. > > Faceting does not use or retrieve stored field values. The labels > faceting returns are from the indexed values. > > "If you want the value returned, it needs to be stored" only applies > to fields in the main document list (the fields that are retrieved for > the top ranked documents). > > -Yonik > -- - Luis Cappa
Re: Issue serving concurrent requests to SOLR on PROD
Hi there, Unfortunately I don' t agree with Shawn when he suggest to update server.xml configuration up to 1 in maxThreads. If Tomcat (due to the concurrent overload you' re suffering, the type of the queries you' re handling, etc.) cannot manage the requested queries what could happen is that Tomcat internal request queue fills and and Out of Memory may appear to say hello to you. Solr is multithreaded and Tomcat also it is, but those Tomcat threads are managed by an internal thread pool with a queue. What Tomcat does is to dispatch requests as much it cans over the web applications that are deployed in it (in this case, Solr). If Tomcat receives more requests that it can answer its internal queue starts to be filled. Those timeouts from the client side you explained seems to be due to Tomcat thread pool and its queue is starting to fill up. You can check it monitoring its memory and thread usage and I' m sure you' ll see how it grows correlated with the number of concurrent requests they receive. Then, for sure you' ll se a more or less horizontal line from memory usage and those timeouts will appear from the cliente side. Basically I think that our scenarios are: - Queries are slow. You should check and try to improve them, because maybe they are bad formed and that queries are destroying your performance. Also, check your index configuration (segments number, etc.). - Queries are OK, but you receive more queries that you can handle. Your configuration and everything is well done, but you are trying to consume more requests that you can dispatch and answer. If you cannot improve your queries, or your queries are OK but you receive more requests that the ones you can handle, the only solution you have is to scale horizontally and startup new Tomcat + Solrs from 4 to N nodes. Best, - Luis Cappa 2015-05-19 15:57 GMT+02:00 Michael Della Bitta : > Are you sure the requests are getting queued because the LB is detecting > that Solr won't handle them? > > The reason why I'm asking is I know that ELB doesn't handle bursts well. > The load balancer needs to "warm up," which essentially means it might be > underpowered at the beginning of a burst. It will spool up more resources > if the average load over the last minute is high. But for that minute it > will definitely not be able to handle a burst. > > If you're testing infrastructure using a benchmarking tool that doesn't > slowly ramp up traffic, you're definitely encountering this problem. > > Michael > > Jani, Vrushank > 2015-05-19 at 03:51 > > Hello, > > We have production SOLR deployed on AWS Cloud. We have currently 4 live > SOLR servers running on m3xlarge EC2 server instances behind ELB (Elastic > Load Balancer) on AWS cloud. We run Apache SOLR in Tomcat container which > is sitting behind Apache httpd. Apache httpd is using prefork mpm and the > request flows from ELB to Apache Httpd Server to Tomcat (via AJP). > > Last few days, we are seeing increase in the requests around 2 > requests minute hitting the LB. In effect we see ELB Surge Queue Length > continuously being around 100. > Surge Queue Length: represents the total number of request pending > submission to the instances, queued by the load balancer; > > This is causing latencies and time outs from Client applications. Our > first reaction was that we don't have enough max connections set either in > HTTPD or Tomcat. What we saw, the servers are very lightly loaded with very > low CPU and memory utilisation. Apache preform settings are as below on > each servers with keep-alive turned off. > > > StartServers 8 > MinSpareServers 5 > MaxSpareServers 20 > ServerLimit 256 > MaxClients 256 > MaxRequestsPerChild 4000 > > > > Tomcat server.xml has following settings. > > maxThreads="500" connectionTimeout="6"/> > For HTTPD – we see that there are lots of TIME_WAIT connections Apache > port around 7000+ but ESTABLISHED connections are around 20. > For Tomact – we see about 60 ESTABLISHED connections on tomcat AJP port. > > So the servers and connections doesn't look like fully utilised to the > capacity. There is no visible stress anywhere. However we still get > requests being queued up on LB because they can not be served from > underlying servers. > > Can you please help me resolving this issue? Can you see any apparent > problem here? Am I missing any configuration or settings for SOLR? > > Your help will be truly appreciated. > > Regards > VJ > > > > > > > Vrushank Jani [http://media.for.truelocal.com.au/signature/img/divider.png] > Senior Java Developer > T 02 8312 1625[http://media.for.truelocal.com.au/signature/img/di
Solr read-only mode with same datadir: commits are not working.
Hey guys, I've doing some tests sharing the same index between three Solr servers: *SolrA*: is allowed to both read and index. The index is stored in a NFS. It has its own configuration files. *SolrB and SolrC*: they can only read from the shared index and each one has their own configuration files. Solrconfig.xml has been changed with the following parameters: single When all servers startup they all work perfectly executing search operations. The problem appears when SolrA index new documents (commiting itself afther that indexation operation). If I manually execute a commit or a softCommit to SolrB or SolrC, they are not able to see the new documents added even if it is suposed to reopen a new searcher when a commit occurs. I have noticed that a commit operation in SolrA shows different segments (the newest ones) compared with the logs that SorlB/SolrC has after a commit. In other words, SolrA shows newer segments and SolrB/SolrC appears to see just the old ones. Is that normal? Any idea or suggestion to solve this? Thank you in advance, :-) Best regards, -- - Luis Cappa
Re: Solr read-only mode with same datadir: commits are not working.
I've seen that StandardDirectoryReader appears in the commit logs. Maybe this DirectoryReader type is caching somehow the old segments in SolrB and SolrC even if they have been commited previosly. If that's true, does exist any other DirectoyReader type (I don't know, SimpleDirectoryReader or FSDirectoyReader) that always read the current segments when a commit happens? 2014-03-12 11:35 GMT+01:00 Luis Cappa Banda : > Hey guys, > > I've doing some tests sharing the same index between three Solr servers: > > *SolrA*: is allowed to both read and index. The index is stored in a NFS. > It has its own configuration files. > *SolrB and SolrC*: they can only read from the shared index and each one > has their own configuration files. Solrconfig.xml has been changed with the > following parameters: > > single > > > When all servers startup they all work perfectly executing search > operations. The problem appears when SolrA index new documents (commiting > itself afther that indexation operation). If I manually execute a commit or > a softCommit to SolrB or SolrC, they are not able to see the new documents > added even if it is suposed to reopen a new searcher when a commit occurs. > > I have noticed that a commit operation in SolrA shows different segments > (the newest ones) compared with the logs that SorlB/SolrC has after a > commit. In other words, SolrA shows newer segments and SolrB/SolrC appears > to see just the old ones. > > Is that normal? Any idea or suggestion to solve this? > > Thank you in advance, :-) > > Best regards, > > -- > - Luis Cappa > -- - Luis Cappa
Re: Solr read-only mode with same datadir: commits are not working.
Hi again! I'm diving inside DirectUpdateHandler2 code and it seems that the problem is that when a commit, when core.openNewSercher(true,true) is called it returns a RefCounted with a new searcher reference that points to an old (probably cached somehow) data dir. I've tried with core.openNewSearcher(false, false) but it doesn't work. What I think that I need is simple: after a commit, SolrIndexSearcher must be reload with a recent index snapshot not using any NRT caching method or whatever. (...) synchronized (solrCoreState.getUpdateLock()) { if (ulog != null) ulog.preSoftCommit(cmd); if (cmd.openSearcher) { core.getSearcher(true, false, waitSearcher); } else { // force open a new realtime searcher so realtime-get and versioning code can see the latest * RefCounted searchHolder = core.openNewSearcher(true, true); * searchHolder.decref(); } if (ulog != null) ulog.postSoftCommit(cmd); } It seems that executing this a new SolrIndexSearcher is returned, but I don't know how to set that new SolrIndexSearcher to the SolrCore instance: * SolrIndexSearcher searcher = core.newSearcher("Last update searcher");* Does anybody knows if possible? Thanks in advance! Best, 2014-03-12 12:10 GMT+01:00 Luis Cappa Banda : > I've seen that StandardDirectoryReader appears in the commit logs. Maybe > this DirectoryReader type is caching somehow the old segments in SolrB and > SolrC even if they have been commited previosly. If that's true, does exist > any other DirectoyReader type (I don't know, SimpleDirectoryReader or > FSDirectoyReader) that always read the current segments when a commit > happens? > > > 2014-03-12 11:35 GMT+01:00 Luis Cappa Banda : > > Hey guys, >> >> I've doing some tests sharing the same index between three Solr servers: >> >> *SolrA*: is allowed to both read and index. The index is stored in a >> NFS. It has its own configuration files. >> *SolrB and SolrC*: they can only read from the shared index and each one >> has their own configuration files. Solrconfig.xml has been changed with the >> following parameters: >> >> single >> >> >> When all servers startup they all work perfectly executing search >> operations. The problem appears when SolrA index new documents (commiting >> itself afther that indexation operation). If I manually execute a commit or >> a softCommit to SolrB or SolrC, they are not able to see the new documents >> added even if it is suposed to reopen a new searcher when a commit occurs. >> >> I have noticed that a commit operation in SolrA shows different segments >> (the newest ones) compared with the logs that SorlB/SolrC has after a >> commit. In other words, SolrA shows newer segments and SolrB/SolrC appears >> to see just the old ones. >> >> Is that normal? Any idea or suggestion to solve this? >> >> Thank you in advance, :-) >> >> Best regards, >> >> -- >> - Luis Cappa >> > > > > -- > - Luis Cappa > -- - Luis Cappa
Spellcheck with Distributed Search (sharding).
Hello! I'be been trying to enable Spellchecking using sharding following the steps from the Wiki, but I failed, :-( What I do is: *Solrconfig.xml* <*searchComponent name="suggest"* class="solr.SpellCheckComponent"> suggest org.apache.solr.spelling.suggest.Suggester org.apache.solr.spelling.suggest.tst.TSTLookup suggestion true <*requestHandler name="/suggest"* class="solr.SearchHandler"> suggestion true suggest 10 suggest *Note:* I have two shards (solr1 and solr2) and both have the same solrconfig.xml. Also, bot indexes were optimized to create the spellchecker indexes. *Query* solr1:8080/events/data/select?q=m&qt=/suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data * * *Response* * * { - responseHeader: { - status: 404, - QTime: 12, - params: { - shards: "solr1:8080/events/data,solr2:8080/events/data", - shards.qt: "/suggestion", - q: "m", - wt: "json", - qt: "/suggestion" } }, - error: { - msg: "Server at http://solr1:8080/events/data returned non ok status:404, message:Not Found", - code: 404 } } More query syntaxes that I used and that doesn't work: http://solr1:8080/events/data/select?q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data<http://solrclusterd.buguroo.dev:8080/events/data/select?q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data> http://solr1:8080/events/data/select?q=*:*&spellcheck.q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data<http://solrclusterd.buguroo.dev:8080/events/data/select?q=*:*&spellcheck.q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data> Any idea of what I'm doing wrong? Thank you very much in advance! Best regards, -- - Luis Cappa
Re: Spellcheck with Distributed Search (sharding).
More info: When executing the Query to a single Solr server it works: http://solr1:8080/events/data/suggest?q=m&wt=json<http://solrclusterd.buguroo.dev:8080/events/data/suggest?q=m&wt=json> { - responseHeader: { - status: 0, - QTime: 1 }, - response: { - numFound: 0, - start: 0, - docs: [ ] }, - spellcheck: { - suggestions: [ - "m", - { - numFound: 4, - startOffset: 0, - endOffset: 1, - suggestion: [ - "marca", - "marcacom", - "mis", - "mispelotas" ] } ] } } But when choosing the Request handler this way it doesn't: http://solr1:8080/events/data/select?*qt=/sugges*t&wt=json&q=*:*<http://solrclusterd.buguroo.dev:8080/events/data/select?qt=/suggest&wt=json&q=*:*> 2013/10/23 Luis Cappa Banda > Hello! > > I'be been trying to enable Spellchecking using sharding following the > steps from the Wiki, but I failed, :-( What I do is: > > *Solrconfig.xml* > > > <*searchComponent name="suggest"* class="solr.SpellCheckComponent"> > > suggest > org.apache.solr.spelling.suggest.Suggester > name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup > suggestion > true > > > > > <*requestHandler name="/suggest"* class="solr.SearchHandler"> > > suggestion > true > suggest > 10 > > > suggest > > > > > *Note:* I have two shards (solr1 and solr2) and both have the same > solrconfig.xml. Also, bot indexes were optimized to create the spellchecker > indexes. > > *Query* > > > solr1:8080/events/data/select?q=m&qt=/suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data > > * > * > *Response* > * > * > { > >- responseHeader: >{ > - status: 404, > - QTime: 12, > - params: > { > - shards: "solr1:8080/events/data,solr2:8080/events/data", > - shards.qt: "/suggestion", > - q: "m", > - wt: "json", > - qt: "/suggestion" > } > }, >- error: >{ > - msg: "Server at http://solr1:8080/events/data returned non ok > status:404, message:Not Found", > - code: 404 > } > > } > > More query syntaxes that I used and that doesn't work: > > > http://solr1:8080/events/data/select?q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data<http://solrclusterd.buguroo.dev:8080/events/data/select?q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data> > > > http://solr1:8080/events/data/select?q=*:*&spellcheck.q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data<http://solrclusterd.buguroo.dev:8080/events/data/select?q=*:*&spellcheck.q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data> > > > Any idea of what I'm doing wrong? > > Thank you very much in advance! > > Best regards, > > -- > - Luis Cappa > -- - Luis Cappa
Re: Spellcheck with Distributed Search (sharding).
Any idea? 2013/10/23 Luis Cappa Banda > More info: > > When executing the Query to a single Solr server it works: > http://solr1:8080/events/data/suggest?q=m&wt=json<http://solrclusterd.buguroo.dev:8080/events/data/suggest?q=m&wt=json> > > { > >- responseHeader: >{ > - status: 0, > - QTime: 1 > }, >- response: >{ > - numFound: 0, > - start: 0, > - docs: [ ] > }, >- spellcheck: >{ > - suggestions: > [ > - "m", > - > { > - numFound: 4, > - startOffset: 0, > - endOffset: 1, > - suggestion: > [ >- "marca", >- "marcacom", >- "mis", >- "mispelotas" >] > } > ] > } > > } > > > But when choosing the Request handler this way it doesn't: > http://solr1:8080/events/data/select?*qt=/sugges*t&wt=json&q=*:*<http://solrclusterd.buguroo.dev:8080/events/data/select?qt=/suggest&wt=json&q=*:*> > > > > > 2013/10/23 Luis Cappa Banda > >> Hello! >> >> I'be been trying to enable Spellchecking using sharding following the >> steps from the Wiki, but I failed, :-( What I do is: >> >> *Solrconfig.xml* >> >> >> <*searchComponent name="suggest"* class="solr.SpellCheckComponent"> >> >> suggest >> org.apache.solr.spelling.suggest.Suggester >> > name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup >> suggestion >> true >> >> >> >> >> <*requestHandler name="/suggest"* class="solr.SearchHandler"> >> >> suggestion >> true >> suggest >> 10 >> >> >> suggest >> >> >> >> >> *Note:* I have two shards (solr1 and solr2) and both have the same >> solrconfig.xml. Also, bot indexes were optimized to create the spellchecker >> indexes. >> >> *Query* >> >> >> solr1:8080/events/data/select?q=m&qt=/suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data >> >> * >> * >> *Response* >> * >> * >> { >> >>- responseHeader: >>{ >> - status: 404, >> - QTime: 12, >> - params: >> { >> - shards: "solr1:8080/events/data,solr2:8080/events/data", >> - shards.qt: "/suggestion", >> - q: "m", >> - wt: "json", >> - qt: "/suggestion" >> } >> }, >>- error: >>{ >> - msg: "Server at http://solr1:8080/events/data returned non ok >> status:404, message:Not Found", >> - code: 404 >> } >> >> } >> >> More query syntaxes that I used and that doesn't work: >> >> >> http://solr1:8080/events/data/select?q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data<http://solrclusterd.buguroo.dev:8080/events/data/select?q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data> >> >> >> http://solr1:8080/events/data/select?q=*:*&spellcheck.q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data<http://solrclusterd.buguroo.dev:8080/events/data/select?q=*:*&spellcheck.q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data> >> >> >> Any idea of what I'm doing wrong? >> >> Thank you very much in advance! >> >> Best regards, >> >> -- >> - Luis Cappa >> > > > > -- > - Luis Cappa > -- - Luis Cappa
Re: Spellcheck with Distributed Search (sharding).
I'ts just a type error, sorry about that! The Request Handler is OK spelled and it doesn't work. 2013/10/24 Dyer, James > Is it that your request handler is named "/suggest" but you are setting > "shards.qt" to "/suggestion" ? > > James Dyer > Ingram Content Group > (615) 213-4311 > > > -Original Message- > From: Luis Cappa Banda [mailto:luisca...@gmail.com] > Sent: Thursday, October 24, 2013 6:22 AM > To: solr-user@lucene.apache.org > Subject: Re: Spellcheck with Distributed Search (sharding). > > Any idea? > > > 2013/10/23 Luis Cappa Banda > > > More info: > > > > When executing the Query to a single Solr server it works: > > http://solr1:8080/events/data/suggest?q=m&wt=json< > http://solrclusterd.buguroo.dev:8080/events/data/suggest?q=m&wt=json> > > > > { > > > >- responseHeader: > >{ > > - status: 0, > > - QTime: 1 > > }, > >- response: > >{ > > - numFound: 0, > > - start: 0, > > - docs: [ ] > > }, > >- spellcheck: > >{ > > - suggestions: > > [ > > - "m", > > - > > { > > - numFound: 4, > > - startOffset: 0, > > - endOffset: 1, > > - suggestion: > > [ > >- "marca", > >- "marcacom", > >- "mis", > > - "mispelotas" > >] > > } > > ] > > } > > > > } > > > > > > But when choosing the Request handler this way it doesn't: > > http://solr1:8080/events/data/select?*qt=/sugges*t&wt=json&q=*:*< > http://solrclusterd.buguroo.dev:8080/events/data/select?qt=/suggest&wt=json&q=*:* > > > > > > > > > > > > 2013/10/23 Luis Cappa Banda > > > >> Hello! > >> > >> I'be been trying to enable Spellchecking using sharding following the > >> steps from the Wiki, but I failed, :-( What I do is: > >> > >> *Solrconfig.xml* > >> > >> > >> <*searchComponent name="suggest"* class="solr.SpellCheckComponent"> > >> > >> suggest > >> org.apache.solr.spelling.suggest.Suggester > >> >> name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup > >> suggestion > >> true > >> > >> > >> > >> > >> <*requestHandler name="/suggest"* class="solr.SearchHandler"> > >> > >> suggestion > >> true > >> suggest > >> 10 > >> > >> > >> suggest > >> > >> > >> > >> > >> *Note:* I have two shards (solr1 and solr2) and both have the same > >> solrconfig.xml. Also, bot indexes were optimized to create the > spellchecker > >> indexes. > >> > >> *Query* > >> > >> > >> > solr1:8080/events/data/select?q=m&qt=/suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data > >> > >> * > >> * > >> *Response* > >> * > >> * > >> { > >> > >>- responseHeader: > >>{ > >> - status: 404, > >> - QTime: 12, > >> - params: > >> { > >> - shards: "solr1:8080/events/data,solr2:8080/events/data", > >> - shards.qt: "/suggestion", > >> - q: "m", > >> - wt: "json", > >> - qt: "/suggestion" > >> } > >> }, > >>- error: > >>{ > >> - msg: "Server at http://solr1:8080/events/data returned non ok > >> status:404, message:Not Found", > >> - code: 404 > >> } > >> > >> } > >> > >> More query syntaxes that I used and that doesn't work: > >> > >> > >> > http://solr1:8080/events/data/select?q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data > < > http://solrclusterd.buguroo.dev:8080/events/data/select?q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data > > > >> > >> > >> > http://solr1:8080/events/data/select?q=*:*&spellcheck.q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solr1:8080/events/data,solr2:8080/events/data > < > http://solrclusterd.buguroo.dev:8080/events/data/select?q=*:*&spellcheck.q=m&qt=suggestion&shards.qt=/suggestion&wt=json&shards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data > > > >> > >> > >> Any idea of what I'm doing wrong? > >> > >> Thank you very much in advance! > >> > >> Best regards, > >> > >> -- > >> - Luis Cappa > >> > > > > > > > > -- > > - Luis Cappa > > > > > > -- > - Luis Cappa > > -- - Luis Cappa
Replication: slow first query after replication.
Hi guys! I have a master-slave replication (Solr 4.1 version) with a 30 seconds polling interval and continuously new documents are indexed, so after 30 seconds always new data must be replicated. My test index is not huge: just 5M documents. I have experimented that a simple "q=*:*" query appears to be very slow (up to 10 secs of QTime). After that first slow query the following "q=*:*" queries are much quicker. I feel that warming up caches after replication has something to say about this weird behavior, but maybe an index re-built is also involved. Question time: *1.* How can I warm up caches against? There exists any solrconfig.xml searcher to configure to be executed after replication events? *2. *My system needs to execute queries to the slaves continuously. If there exists any warm up way to reload caches, some queries will experience slow response times until reload has finished, isn't it? *3. *After a replication has done, does Solr execute any index rebuild operation that slow down query responses, or this poor performance is just due to caches? *4. *My system is always querying by the latest documents indexed (I'm filtering by document dates), and I don't use "fq" to execute that queries. In this scenario, do you recommend to disable caches? Thank you very much in advance! Best, -- - Luis Cappa
Re: Replication: slow first query after replication.
Against --> again, :-) 2013/11/5 Luis Cappa Banda > Hi guys! > > I have a master-slave replication (Solr 4.1 version) with a 30 seconds > polling interval and continuously new documents are indexed, so after 30 > seconds always new data must be replicated. My test index is not huge: just > 5M documents. > > I have experimented that a simple "q=*:*" query appears to be very slow > (up to 10 secs of QTime). After that first slow query the following "q=*:*" > queries are much quicker. I feel that warming up caches after replication > has something to say about this weird behavior, but maybe an index re-built > is also involved. > > Question time: > > *1.* How can I warm up caches against? There exists any solrconfig.xml > searcher to configure to be executed after replication events? > > *2. *My system needs to execute queries to the slaves continuously. If > there exists any warm up way to reload caches, some queries will experience > slow response times until reload has finished, isn't it? > > *3. *After a replication has done, does Solr execute any index rebuild > operation that slow down query responses, or this poor performance is just > due to caches? > > *4. *My system is always querying by the latest documents indexed (I'm > filtering by document dates), and I don't use "fq" to execute that queries. > In this scenario, do you recommend to disable caches? > > Thank you very much in advance! > > Best, > > -- > - Luis Cappa > -- - Luis Cappa
Re: Is there any limit how many documents can be indexed by apache solr
Hello! Checkout also your application server logs. Maybe you're trying to index Documents with any syntax error and they are skipped. Regards, - Luis Cappa 2013/11/26 Alejandro Marqués Rodríguez > Hi, > > In lucene you are supossed to be able to index up to 274 billion documents > ( http://lucene.apache.org/core/3_0_3/fileformats.html#Limitations ), so > in > Solr should be something like that. Anyway the maximum number is quite > bigger than those 11.000 ;) > > Could it be that you are reusing IDs so the new documents overwrite the old > ones? > > > 2013/11/26 Kamal Palei > > > Dear All > > I am using Apache solr 3.6.2 with Drupal 7. > > Users keeps adding their profiles (resumes) and with cron task from > Drupal, > > documents get indexed. > > > > Recently I observed, after indexing around 11,000 documents, further > > documents are not getting indexed. > > > > Is there any configuration for max documents those can be indexed. > > > > Kindly help. > > > > Thanks > > kamal > > > > > > -- > Alejandro Marqués Rodríguez > > Paradigma Tecnológico > http://www.paradigmatecnologico.com > Avenida de Europa, 26. Ática 5. 3ª Planta > 28224 Pozuelo de Alarcón > Tel.: 91 352 59 42 > -- - Luis Cappa
Facet count mismatch.
Hello! I've installed a classical two shards Solr 4.5 topology without SolrCloud balancing with an HA proxy. I've got a *copyField* like this: * * Copied from this one: * * * * ** * * * * * * * * * * * * * * ** When faceting with *tagValues* field I've got a total count of 3: - facet_counts: { - facet_queries: { }, - facet_fields: { - tagsValues: [ - "sucks", - 3 ] }, - facet_dates: { }, - facet_ranges: { } } Bug when searching like this with *tagValues* the total number of documents is not three, but two: - params: { - facet: "true", - shards: "solr1.test:8081/comments/data,solr2.test:8080/comments/data", - facet.mincount: "1", - facet.sort: "count", - q: "tagsValues:"sucks"", - facet.limit: "-1", - facet.field: "tagsValues", - wt: "json" } Any idea of what's happening here? I'm confused, :-/ Regards, -- - Luis Cappa
Optimize and replication: some questions battery.
Hello! I've got an scenario where I index very frequently on master servers and replicate to slave servers with one minute polling. Master indexes are growing fast and I would like to optimize indexes to improve search queries. However... 1. During an optimize operation, can master servers index new documents? I suppose that is not possible. 2. The optimize operation can take probably minutes, hours... and then will affect to live/production environment because new documents won't be indexed. Should I optimize each slave indexes, instead? What will happen with replication? Will slave servers "loose" index identifiers that allow them to replicate delta documents from master after optimizing them? Will the next replication update slaves indexes overriding the optimized index? Thank you very much in advance. Regards, -- - Luis Cappa
Re: Optimize and replication: some questions battery.
Hi Chris, Thank you very much for your response! It was very instructive. I knew some performance tips to improve search and I configured a very low merge factor (2) to boost search operations instead of indexation ones. I haven't got a deep knowledge of internal Lucene behavior in this case, but I thought that somehow an optimization operation may rebuild the index checking and fixing corrupted segments, merging again whatever it should merge, etc., and finally the "new" master index will be a better index where to insert new data frequently. One last question: do you think that this kind of scenario where I continuously index and replicate data will corrupt the index? In the past I developed a simple tool using a Lucene class to check the index and alert me if it's corrupted or not, so if you think that this scenario is dangerous maybe I can reuse that tool to prevent weird production situations. Best, - Luis Cappa 2014-02-05 Chris Hostetter : > > : I've got an scenario where I index very frequently on master servers and > : replicate to slave servers with one minute polling. Master indexes are > : growing fast and I would like to optimize indexes to improve search > : queries. However... > > For a scenerio where your index is changing that rapidly, you don't wnat > to use the optimize command at all -- it's not going to improve the > performance of anything... > > > https://wiki.apache.org/solr/SolrPerformanceFactors#Optimization_Considerations > > You may want to optimize an index in certain situations -- ie: if you > build your index once, and then never modify it. > > If you have a rapidly changing index, rather than optimizing, you likely > simply want to use a lower merge factor. Optimizing is very expensive, and > if the index is constantly changing, the slight performance boost will not > last long. The tradeoff is not often worth it for a non static index. > > In a master slave setup, sometimes you may also want to optimize on the > master so that slaves serve from a single segment index. This will can > greatly increase the time to replicate the index though, so this is often > not desirable either. > > > > -Hoss > http://www.lucidworks.com/ > -- - Luis Cappa
Re: Optimize and replication: some questions battery.
Hi Toke! Thanks for answering. That's it: I talk about index corruption just to prevent, not because I have already noticed it. During some tests in the past I checked that a mergeFactor of 2 improves more than a little bit search speed instead common merge factors such as 10, for example. Of course index speed is penalized, but my production architecture is based on task queues and workers that index into Solr, and I've developed a custom SolrCluster module that it's a black box that acts as a single Solr server from an outside point of view, but it balances into N Solr master servers internally deciding where to index, checking Solr servers status (alive, dead), executing sharding search queries, etc., so that point is controlled: if I need more index speed I can add new Solr masters and/or new worker modules to dequeue, process and execute index operations. My principal worry was about optimizing at much as possible search speed thanks to optimizing, mergeFactor tunning, caches setup, etc. Thanks a lot! 2014-02-06 Toke Eskildsen : > On Thu, 2014-02-06 at 10:22 +0100, Luis Cappa Banda wrote: > > I knew some performance tips to improve search and I configured a very > > low merge factor (2) to boost search > > operations instead of indexation ones. > > That would give you a small search speed increase and a huge penalty on > indexing speed (as it will perform large merges all the time) and > replication speed (as all file data will be updated frequently instead > of just a subset of them). Unless you are absolutely sure that you need > the small search speed increase, you should set this to a higher number. > > > I haven't got a deep knowledge of internal Lucene behavior in this > > case, but I thought that somehow an optimization operation may rebuild > > the index checking and fixing corrupted segments, > > To my knowledge, there are not attempts to repair corrupted segments > during merge. I hope you speak of corruption as a precaution and not > because it is something that happens to your setup. If you have > corrupted indexes at any time, you should investigate how that happens, > instead of trying to repair them. > > > One last question: do you think that this kind of scenario where I > > continuously index and replicate data will corrupt the index? > > Lucene is used in a lot of places with massive updates. Aside for > JVM-related bugs, it has proven to be very stable under these > conditions. So not, the indexing will not corrupt anything. > > - Toke Eskildsen, State and University Library, Denmark > > -- - Luis Cappa
solr/lucene 4.10 out of memory issues
hey guys, I'm running a solrcloud cluster consisting of five nodes. My largest index contains 2.5 million documents and occupies about 6 gigabytes of disk space. We recently switched to the latest solr version (4.10) from version 4.4.1 which we ran successfully for about a year without any major issues. >From the get go we started having memory problems caused by the CMS old heap usage being filled up incrementally. It starts out with a very low memory consumption and after 12 hours or so it ends up using up all available heap space. We thought it could be one of the caches we had configured, so we reduced our main core filter cache max size from 1024 to 512 elements. The only thing we accomplished was that the cluster ran for a longer time than before. I generated several heapdumps and basically what is filling up the heap is lucene's field cache. it gets bigger and bigger until it fills up all available memory. My jvm memory settings are the following: -Xms15g -Xmx15g -XX:PermSize=512m -XX:MaxPermSize=512m -XX:NewSize=5g -XX:MaxNewSize=5g -XX:+UseParNewGC -XX:+ExplicitGCInvokesConcurrent -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC What's weird to me is that we didn't have this problem before, I'm thinking this is some kind of memory leak issue present in the new lucene. We ran our old cluster for several weeks at a time without having to redeploy because of config changes or other reasons. Was there some issue reported related to elevated memory consumption by the field cache? any help would be greatly appreciated. regards, -- Luis Carlos Guerrero about.me/luis.guerrero
Re: solr/lucene 4.10 out of memory issues
Thanks for the response, I've been working on solving some of the most evident issues and I also added your garbage collector parameters. First of all the Lucene field cache is being filled with some entries which are marked as 'insanity'. Some of these were related to a custom field that we use for our ranking. We fixed our custom plugin classes so that we wouldn't see any entries related to those fields there, but it seems there are other related problems with the field cache. Mainly the cache is being filled with these types of insanity entries: 'SUBREADER: Found caches for descendants of StandardDirectoryReader' They are all related to standard solr fields. Could it be that our current schemas and configs have some incorrect setting that is not compliant with this lucene version? I'll keep investigating the subject but if there is any additional information you can give me about these types of field cache insanity warnings it would be really helpful. On Thu, Sep 11, 2014 at 3:00 PM, Timothy Potter wrote: > Probably need to look at it running with a profiler to see what's up. > Here's a few additional flags that might help the GC work better for > you (which is not to say there isn't a leak somewhere): > > -XX:MaxTenuringThreshold=8 -XX:CMSInitiatingOccupancyFraction=40 > > This should lead to a nice up-and-down GC profile over time. > > On Thu, Sep 11, 2014 at 10:52 AM, Luis Carlos Guerrero > wrote: > > hey guys, > > > > I'm running a solrcloud cluster consisting of five nodes. My largest > index > > contains 2.5 million documents and occupies about 6 gigabytes of disk > > space. We recently switched to the latest solr version (4.10) from > version > > 4.4.1 which we ran successfully for about a year without any major > issues. > > From the get go we started having memory problems caused by the CMS old > > heap usage being filled up incrementally. It starts out with a very low > > memory consumption and after 12 hours or so it ends up using up all > > available heap space. We thought it could be one of the caches we had > > configured, so we reduced our main core filter cache max size from 1024 > to > > 512 elements. The only thing we accomplished was that the cluster ran > for a > > longer time than before. > > > > I generated several heapdumps and basically what is filling up the heap > is > > lucene's field cache. it gets bigger and bigger until it fills up all > > available memory. > > > > My jvm memory settings are the following: > > > > -Xms15g -Xmx15g -XX:PermSize=512m -XX:MaxPermSize=512m -XX:NewSize=5g > > -XX:MaxNewSize=5g > > -XX:+UseParNewGC -XX:+ExplicitGCInvokesConcurrent -XX:+PrintGCDateStamps > > -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError > -XX:+UseConcMarkSweepGC > > What's weird to me is that we didn't have this problem before, I'm > thinking > > this is some kind of memory leak issue present in the new lucene. We ran > > our old cluster for several weeks at a time without having to redeploy > > because of config changes or other reasons. Was there some issue reported > > related to elevated memory consumption by the field cache? > > > > any help would be greatly appreciated. > > > > regards, > > > > -- > > Luis Carlos Guerrero > > about.me/luis.guerrero > -- Luis Carlos Guerrero about.me/luis.guerrero
Re: solr/lucene 4.10 out of memory issues
I checked and these 'insanity' cached keys correspond to fields we use for both grouping and faceting. The same behavior is documented here: https://issues.apache.org/jira/browse/SOLR-4866, although I have single shards for every replica which the jira says is a setup which should not generate these issues. What I don't get is why the cluster was running fine with solr 4.4, although double checking I was using LUCENE_40 as the match version. If I use this match version in my current running 4.10 cluster will it make a difference, or will I experience more issues than if I just roll back to 4.4 with LUCENE_40 match version? The problem in the end is that the fieldcache grows unlimitedly. I'm thinking its because of the insanity entries but I'm not really sure. It seem like a really big problem to leave unattended or is the use case for faceting and grouping on the same field not that common? On Tue, Sep 16, 2014 at 11:06 AM, Luis Carlos Guerrero < lcguerreroc...@gmail.com> wrote: > Thanks for the response, I've been working on solving some of the most > evident issues and I also added your garbage collector parameters. First of > all the Lucene field cache is being filled with some entries which are > marked as 'insanity'. Some of these were related to a custom field that we > use for our ranking. We fixed our custom plugin classes so that we wouldn't > see any entries related to those fields there, but it seems there are other > related problems with the field cache. Mainly the cache is being filled > with these types of insanity entries: > > 'SUBREADER: Found caches for descendants of StandardDirectoryReader' > > They are all related to standard solr fields. Could it be that our current > schemas and configs have some incorrect setting that is not compliant with > this lucene version? I'll keep investigating the subject but if there is > any additional information you can give me about these types of field cache > insanity warnings it would be really helpful. > > On Thu, Sep 11, 2014 at 3:00 PM, Timothy Potter > wrote: > >> Probably need to look at it running with a profiler to see what's up. >> Here's a few additional flags that might help the GC work better for >> you (which is not to say there isn't a leak somewhere): >> >> -XX:MaxTenuringThreshold=8 -XX:CMSInitiatingOccupancyFraction=40 >> >> This should lead to a nice up-and-down GC profile over time. >> >> On Thu, Sep 11, 2014 at 10:52 AM, Luis Carlos Guerrero >> wrote: >> > hey guys, >> > >> > I'm running a solrcloud cluster consisting of five nodes. My largest >> index >> > contains 2.5 million documents and occupies about 6 gigabytes of disk >> > space. We recently switched to the latest solr version (4.10) from >> version >> > 4.4.1 which we ran successfully for about a year without any major >> issues. >> > From the get go we started having memory problems caused by the CMS old >> > heap usage being filled up incrementally. It starts out with a very low >> > memory consumption and after 12 hours or so it ends up using up all >> > available heap space. We thought it could be one of the caches we had >> > configured, so we reduced our main core filter cache max size from 1024 >> to >> > 512 elements. The only thing we accomplished was that the cluster ran >> for a >> > longer time than before. >> > >> > I generated several heapdumps and basically what is filling up the heap >> is >> > lucene's field cache. it gets bigger and bigger until it fills up all >> > available memory. >> > >> > My jvm memory settings are the following: >> > >> > -Xms15g -Xmx15g -XX:PermSize=512m -XX:MaxPermSize=512m -XX:NewSize=5g >> > -XX:MaxNewSize=5g >> > -XX:+UseParNewGC -XX:+ExplicitGCInvokesConcurrent -XX:+PrintGCDateStamps >> > -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError >> -XX:+UseConcMarkSweepGC >> > What's weird to me is that we didn't have this problem before, I'm >> thinking >> > this is some kind of memory leak issue present in the new lucene. We ran >> > our old cluster for several weeks at a time without having to redeploy >> > because of config changes or other reasons. Was there some issue >> reported >> > related to elevated memory consumption by the field cache? >> > >> > any help would be greatly appreciated. >> > >> > regards, >> > >> > -- >> > Luis Carlos Guerrero >> > about.me/luis.guerrero >> > > > > -- > Luis Carlos Guerrero > about.me/luis.guerrero > -- Luis Carlos Guerrero about.me/luis.guerrero
Syllabification, readability metric
Hi, Does Lucene support syllabification of words out of the box? If so is there support for brazilian portuguese? I'm trying to setup a readability score for short text descriptions and this would be really helpful. thanks, -- Luis Carlos Guerrero about.me/luis.guerrero
Delete in Solr based on foreign key (like SQL delete from … where id in (select id from…)
Given the following Solr data: 1008rs1cz0icl2pk 2014-10-07T14:18:29.784Z h60fmtybz0i7sx87 1481314421768716288 u42xyz1cz0i7sx87 h60fmtybz0i7sx87 1481314421768716288 u42xyz1cz0i7sx87 h60fmtybz0i7sx87 1481314421448900608 I would like to know how to *DELETE documents* above on the Solr console or using a script that achieves the same result as issuing the following statement in SQL (assuming all of these columns existed in a table called x ): DELETE FROM x WHERE foreign_key_docid_s in (select docid_s from x where message_state_ts < '2014-10-05' and message_state_ts > '2014-10-01') Basically, delete all derived documents whose foreign key is the same as the primary key where the primary key is selected between 2 dates. Question originally posted on stackoverflow.com -> http://stackoverflow.com/questions/26248372/delete-in-solr-based-on-foreign-key-like-sql-delete-from-where-id-in-selec
Re: Delete in Solr based on foreign key (like SQL delete from … where id in (select id from…)
Hi matthew, I'm more than glad getting the ids and deleting them in a separate query, if need be. But how do I do it? It's dozens of thousands of ids that I have to delete. What's the strategy to delete them? On Fri, Oct 10, 2014 at 4:16 AM, Matthew Nigl wrote: > I was going to say that the below should do what you are asking: > > {!join from=docid_s > to=foreign_key_docid_s}(message_state_ts:[* TO 2014-10-05T00:00:00Z} AND > message_state_ts:{2014-10-01T00:00:00Z TO *]) > > But I get the same response as in > https://issues.apache.org/jira/browse/SOLR-6357 > > I can't think of any other queries at the moment. You might consider using > the above query (which should work as a normal select query) to get the > IDs, then delete them in a separate query. > > > On 10 October 2014 07:31, Luis Festas Matos wrote: > > > Given the following Solr data: > > > > > > 1008rs1cz0icl2pk > > 2014-10-07T14:18:29.784Z > > h60fmtybz0i7sx87 > > 1481314421768716288 > > u42xyz1cz0i7sx87 > > h60fmtybz0i7sx87 > > 1481314421768716288 > >u42xyz1cz0i7sx87 > >h60fmtybz0i7sx87 > >1481314421448900608 > > > > I would like to know how to *DELETE documents* above on the Solr console > or > > using a script that achieves the same result as issuing the following > > statement in SQL (assuming all of these columns existed in a table > called x > > ): > > > > DELETE FROM x WHERE foreign_key_docid_s in (select docid_s from x > > where message_state_ts < '2014-10-05' and message_state_ts > > > '2014-10-01') > > > > Basically, delete all derived documents whose foreign key is the same as > > the primary key where the primary key is selected between 2 dates. > > > > Question originally posted on stackoverflow.com -> > > > > > http://stackoverflow.com/questions/26248372/delete-in-solr-based-on-foreign-key-like-sql-delete-from-where-id-in-selec > > >
Email regular expression.
Hello everyone! Unfortunately I have to search all E-mail addresses found in a text field from each document. I've been reading for a while how to use RegExp's in Solr, but after trying some of them they didn't work. I've noticed that Lucene RegExp syntax sometimes is very different from the classic RegExp syntax, so that may be the reason why they didn't work for me, and maybe someone more expert can help me. The syntax is the following: *E-mail: * text:/[a-z0-9_\|-]+(\.[a-z0-9_\|-]|)*@[a-z0-9-]|(\.[a-z0-9-]|)*\.([a-z]{2,4})/ Thank you very much in advance! Best regards, -- - Luis Cappa
Re: Email regular expression.
Hello, Jack, Steve, Thank you for your answers. I´ve never used UAX29URLEmailTokenizerFactory, but I´ve read about it before trying RegExp´s queries. As far as I know, UAX29URLEmailTokenizerFactory allows to tokenize an entry text value into patterns that match URLs, E-mails, etc. Reading the documentation I haven´t found any way to select just E-mail patterns, not URL ones, for example. I feel that it may have sense to specify one or multiple patterns in a configuration file to be setted during the Tokenizer definition in the schema.xml, but I found nothing. I´ve just want to retrieve those documents indexed where they appear at least one E-mail inside de text. However, even using UAX29URLEmailTokenizerFactory, and suposing that I store that E-mail data in a field called 'emails' (I feel creative, hehe), a query like the following appears to be... dirty: http://localhost:8080/mysolr/select?q=emails:[* TO *]&start=0&rows=10&sort=mydate desc What do you think about? And Andy... I know many RegExps to find E-mail patterns in a text - that wasn´t my question, and of course there is no perfect one. However, Lucene RegExp syntax is different from classic RegExp one, so is not as easy as copy & paste any RegExps and, voilá! E-mails everywhere. Thank you very much in advance, Best regards, 2013/7/30 Jack Krupansky > Just use the UAX29URLEmailTokenizerFactory, which recognizes email > addresses. > > Any particular reason that you're trying to reinvent the wheel? > > -- Jack Krupansky > > -Original Message- From: Luis Cappa Banda > Sent: Tuesday, July 30, 2013 10:53 AM > To: solr-user@lucene.apache.org > Subject: Email regular expression. > > > Hello everyone! > > Unfortunately I have to search all E-mail addresses found in a text field > from each document. I've been reading for a while how to use RegExp's in > Solr, but after trying some of them they didn't work. I've noticed that > Lucene RegExp syntax sometimes is very different from the classic RegExp > syntax, so that may be the reason why they didn't work for me, and maybe > someone more expert can help me. > > The syntax is the following: > > *E-mail: * > > text:/[a-z0-9_\|-]+(\.[a-z0-9_**\|-]|)*@[a-z0-9-]|(\.[a-z0-9-]** > |)*\.([a-z]{2,4})/ > > Thank you very much in advance! > > Best regards, > > -- > - Luis Cappa > -- - Luis Cappa
Re: Email regular expression.
Hello guys, Hey, I think I´ve found how to do this just adding a filter. Just for anyone´s curiosity: Anyway, I still need to do a query like the following to retrieve those documents with at least one E-mail detected: http://localhost:8080/mysolr/select?q=emails:[* TO *]&start=0&rows=10&sort=mydate desc And I don´t like it, to be honest, Regards, 2013/7/30 Luis Cappa Banda > Hello, Jack, Steve, > > Thank you for your answers. I´ve never used UAX29URLEmailTokenizerFactory, > but I´ve read about it before trying RegExp´s queries. As far as I know, > UAX29URLEmailTokenizerFactory > allows to tokenize an entry text value into patterns that match URLs, > E-mails, etc. Reading the documentation I haven´t found any way to select > just E-mail patterns, not URL ones, for example. I feel that it may have > sense to specify one or multiple patterns in a configuration file to be > setted during the Tokenizer definition in the schema.xml, but I found > nothing. > > I´ve just want to retrieve those documents indexed where they appear at > least one E-mail inside de text. However, even using > UAX29URLEmailTokenizerFactory, > and suposing that I store that E-mail data in a field called 'emails' (I > feel creative, hehe), a query like the following appears to be... dirty: > > http://localhost:8080/mysolr/select?q=emails:[* TO > *]&start=0&rows=10&sort=mydate desc > > What do you think about? > > And Andy... I know many RegExps to find E-mail patterns in a text - that > wasn´t my question, and of course there is no perfect one. However, Lucene > RegExp syntax is different from classic RegExp one, so is not as easy as > copy & paste any RegExps and, voilá! E-mails everywhere. > > Thank you very much in advance, > > Best regards, > > > > > > 2013/7/30 Jack Krupansky > >> Just use the UAX29URLEmailTokenizerFactory, which recognizes email >> addresses. >> >> Any particular reason that you're trying to reinvent the wheel? >> >> -- Jack Krupansky >> >> -Original Message- From: Luis Cappa Banda >> Sent: Tuesday, July 30, 2013 10:53 AM >> To: solr-user@lucene.apache.org >> Subject: Email regular expression. >> >> >> Hello everyone! >> >> Unfortunately I have to search all E-mail addresses found in a text field >> from each document. I've been reading for a while how to use RegExp's in >> Solr, but after trying some of them they didn't work. I've noticed that >> Lucene RegExp syntax sometimes is very different from the classic RegExp >> syntax, so that may be the reason why they didn't work for me, and maybe >> someone more expert can help me. >> >> The syntax is the following: >> >> *E-mail: * >> >> text:/[a-z0-9_\|-]+(\.[a-z0-9_**\|-]|)*@[a-z0-9-]|(\.[a-z0-9-]** >> |)*\.([a-z]{2,4})/ >> >> Thank you very much in advance! >> >> Best regards, >> >> -- >> - Luis Cappa >> > > > > -- > - Luis Cappa > -- - Luis Cappa
Re: Email regular expression.
I´ve tried this kind of queries in the past but I detected that they have a poor performance and that they are incredibly slow. But it´s just my experience, maybe someone can share with us any other opinion. 2013/7/30 Raymond Wiker > On Jul 30, 2013, at 22:05 , Luis Cappa Banda wrote: > > Anyway, I still need to do a query like the following to retrieve those > > documents with at least one E-mail detected: > > > > http://localhost:8080/mysolr/select?q=emails:[* TO > > *]&start=0&rows=10&sort=mydate desc > > Can't you just use emails:* ? > > > -- - Luis Cappa
Re: Email regular expression.
I´ve been re-reading about that in older solr-mail-list messages, and it seems that a query like 'field:*' implies that internally the whole terms indexed are checked one by one even if they are some caches filled for that field. That make reasonable my poor performance in the past. However, it may be possible to create a field called 'flagEmails' that will be true if the field 'emails' is filled via UAX29URLEmailTokenizerFactory. Does anyone implemented during index-time this kind of behavior? Is it possible? Regards, 2013/7/30 Luis Cappa Banda > I´ve tried this kind of queries in the past but I detected that they have > a poor performance and that they are incredibly slow. But it´s just my > experience, maybe someone can share with us any other opinion. > > > 2013/7/30 Raymond Wiker > >> On Jul 30, 2013, at 22:05 , Luis Cappa Banda wrote: >> > Anyway, I still need to do a query like the following to retrieve those >> > documents with at least one E-mail detected: >> > >> > http://localhost:8080/mysolr/select?q=emails:[* TO >> > *]&start=0&rows=10&sort=mydate desc >> >> Can't you just use emails:* ? >> >> >> > > > -- > - Luis Cappa > -- - Luis Cappa
Re: Performance question on Spatial Search
Hey, David, I´ve been reading the thread and I think that is one of the most educative mail-threads I´ve read in Solr mailing list. Just for curiosity: internally for Solr, is it the same a query like "field:*" and "field:[* TO *]"? I think that it´s expected to receive the same number of numFound documents, but I would like to know the internal behavior of Solr. Best regards, - Luis Cappa 2013/7/30 Smiley, David W. > Steve, > The FieldCache and DocValues are irrelevant to this problem. Solr's > FilterCache is, and Lucene has no counterpart. Perhaps it would be cool > if Solr could look for expensive field:* usages when parsing its queries > and re-write them to use the FilterCache. That's quite doable, I think. > I just created an issue for it: > https://issues.apache.org/jira/browse/SOLR-5093but don't expect me to > work on it anytime soon ;-) > > > ~ David > > On 7/30/13 2:02 PM, "Steven Bower" wrote: > > >I am curious why the field:* walks the entire terms list.. could this be > >discovered from a field cache / docvalues? > > > >steve > > > > > >On Tue, Jul 30, 2013 at 2:00 PM, Steven Bower wrote: > > > >> Until I get the data refed I there was another field (a date field) that > >> was there and not when the geo field was/was not... i tried that field:* > >> and query times come down to 2.5s .. also just removing that filter > >>brings > >> the query down to 30ms.. so I'm very hopeful that with just a boolean > >>i'll > >> be down in that sub 100ms range.. > >> > >> steve > >> > >> > >> On Tue, Jul 30, 2013 at 12:02 PM, Steven Bower > >>wrote: > >> > >>> Will give the boolean thing a shot... makes sense... > >>> > >>> > >>> On Tue, Jul 30, 2013 at 11:53 AM, Smiley, David W. > >>>wrote: > >>> > >>>> I see the problem ‹ it's +pp:*. It may look innocent but it's a > >>>> performance killer. What your telling Lucene to do is iterate over > >>>> *every* term in this index to find all documents that have this data. > >>>> Most fields are pretty slow to do that. Lucene/Solr does not have > >>>>some > >>>> kind of cache for this. Instead, you should index a new boolean field > >>>> indicating wether or not 'pp' is populated and then do a simple true > >>>> check > >>>> against that field. Another approach you could do right now without > >>>> reindexing is to simplify the last 2 clauses of your 3-clause boolean > >>>> query by using the "IsDisjointTo" predicate. But unfortunately Lucene > >>>> doesn't have a generic filter cache capability and so this predicate > >>>>has > >>>> no place to cache the whole-world query it does internally (each and > >>>> every > >>>> time it's used), so it will be slower than the boolean field I > >>>>suggested > >>>> you add. > >>>> > >>>> > >>>> Nevermind on LatLonType; it doesn't support JTS/Polygons. There is > >>>> something close called SpatialPointVectorFieldType that could be > >>>>modified > >>>> trivially but it doesn't support it now. > >>>> > >>>> ~ David > >>>> > >>>> On 7/30/13 11:32 AM, "Steven Bower" wrote: > >>>> > >>>> >#1 Here is my query: > >>>> > > >>>> >sort=vid asc > >>>> >start=0 > >>>> >rows=1000 > >>>> >defType=edismax > >>>> >q=*:* > >>>> >fq=recordType:"xxx" > >>>> >fq=vt:"X12B" AND > >>>> >fq=(cls:"3" OR cls:"8") > >>>> >fq=dt:[2013-05-08T00:00:00.00Z TO 2013-07-08T00:00:00.00Z] > >>>> >fq=(vid:86XXX73 OR vid:86XXX20 OR vid:89XXX60 OR vid:89XXX72 OR > >>>> >vid:89XXX48 > >>>> >OR vid:89XXX31 OR vid:89XXX28 OR vid:89XXX67 OR vid:90XXX76 OR > >>>> vid:90XXX33 > >>>> >OR vid:90XXX47 OR vid:90XXX97 OR vid:90XXX69 OR vid:90XXX31 OR > >>>> vid:90XXX44 > >>>> >OR vid:91XXX82 OR vid:91XXX08 OR vid:91XXX32 OR vid:91XXX13 OR > >>>> vid:91XXX87 > >>>> >OR vid:91XXX82 OR vid:91XXX48 OR vid:91XXX34 OR vid:91XXX31 OR > >>>
Re: Performance question on Spatial Search
Thank you very much, David. That was a great explanation! Regards, - Luis Cappa 2013/7/30 Smiley, David W. > Luis, > > field:* and field:[* TO *] are semantically equivalent -- they have the > same effect. But they internally work differently depending on the field > type. The field type has the chance to intercept the range query to do > something smart (FieldType.getRangeQuery(...)). Numeric/Date (trie) > fields have a reasonably quick implementation for such queries. Spatial > fields could be enhanced similarly but aren't (yet). So in general you > should avoid field:* in favor of field:[* TO *]. Perhaps Solr should > redirect a field:* to the FieldType's getRangeQuery method so that there > is no difference. Anyway, the official/best way to ask for all data in a > field (without cheating and indexing a boolean in a different field) is > field:[* TO *]. > > ~ David > > On 7/30/13 4:44 PM, "Luis Cappa Banda" wrote: > > >Hey, David, > > > >I´ve been reading the thread and I think that is one of the most educative > >mail-threads I´ve read in Solr mailing list. Just for curiosity: > >internally > >for Solr, is it the same a query like "field:*" and "field:[* TO *]"? I > >think that it´s expected to receive the same number of numFound documents, > >but I would like to know the internal behavior of Solr. > > > >Best regards, > > > >- Luis Cappa > > > > > >2013/7/30 Smiley, David W. > > > >> Steve, > >> The FieldCache and DocValues are irrelevant to this problem. Solr's > >> FilterCache is, and Lucene has no counterpart. Perhaps it would be cool > >> if Solr could look for expensive field:* usages when parsing its queries > >> and re-write them to use the FilterCache. That's quite doable, I think. > >> I just created an issue for it: > >> https://issues.apache.org/jira/browse/SOLR-5093but don't expect me > >>to > >> work on it anytime soon ;-) > >> > >> > >> ~ David > >> > >> On 7/30/13 2:02 PM, "Steven Bower" wrote: > >> > >> >I am curious why the field:* walks the entire terms list.. could this > >>be > >> >discovered from a field cache / docvalues? > >> > > >> >steve > >> > > >> > > >> >On Tue, Jul 30, 2013 at 2:00 PM, Steven Bower > >>wrote: > >> > > >> >> Until I get the data refed I there was another field (a date field) > >>that > >> >> was there and not when the geo field was/was not... i tried that > >>field:* > >> >> and query times come down to 2.5s .. also just removing that filter > >> >>brings > >> >> the query down to 30ms.. so I'm very hopeful that with just a boolean > >> >>i'll > >> >> be down in that sub 100ms range.. > >> >> > >> >> steve > >> >> > >> >> > >> >> On Tue, Jul 30, 2013 at 12:02 PM, Steven Bower > >> >>wrote: > >> >> > >> >>> Will give the boolean thing a shot... makes sense... > >> >>> > >> >>> > >> >>> On Tue, Jul 30, 2013 at 11:53 AM, Smiley, David W. > >> >>>wrote: > >> >>> > >> >>>> I see the problem ‹ it's +pp:*. It may look innocent but it's a > >> >>>> performance killer. What your telling Lucene to do is iterate over > >> >>>> *every* term in this index to find all documents that have this > >>data. > >> >>>> Most fields are pretty slow to do that. Lucene/Solr does not have > >> >>>>some > >> >>>> kind of cache for this. Instead, you should index a new boolean > >>field > >> >>>> indicating wether or not 'pp' is populated and then do a simple > >>true > >> >>>> check > >> >>>> against that field. Another approach you could do right now > >>without > >> >>>> reindexing is to simplify the last 2 clauses of your 3-clause > >>boolean > >> >>>> query by using the "IsDisjointTo" predicate. But unfortunately > >>Lucene > >> >>>> doesn't have a generic filter cache capability and so this > >>predicate > >> >>>>has > >> >>>> no place to cache the whole-world query it
EmbeddedSolrServer Solr 4.4.0 bug?
Hello guys, Since I upgrade from 4.1.0 to 4.4.0 version I've noticed that EmbeddedSolrServer has changed a little the way of construction: *Solr 4.1.0 style:* CoreContainer coreContainer = new CoreContainer(*solrHome, new File(solrHome+"/solr.xml"*)); EmbeddedSolrServer localSolrServer = new EmbeddedSolrServer(coreContainer, core); *Solr 4.4.0 new style: * CoreContainer coreContainer = new CoreContainer(*solrHome*); EmbeddedSolrServer localSolrServer = new EmbeddedSolrServer(coreContainer, core); However, it's not working. I've got the following solr.xml configuration file: * * ** ** ** And resources appears to be loaded correctly: *2013-07-31 09:46:37,583 47889 [main] INFO org.apache.solr.core.ConfigSolr - Loading container configuration from /opt/solr/solr.xml* But when indexing into core with coreName 'core', it throws an Exception: *2013-07-31 09:50:49,409 5189 [main] ERROR com.buguroo.solr.index.WriteIndex - No such core: core* Or I am sleppy, something that's possible, or there is some kind of bug here. Best regards, -- - Luis Cappa
Re: EmbeddedSolrServer Solr 4.4.0 bug?
Thank you very much, Alan. Now it's working! I agree with you: this kind of things should be documented at least in CHANGELOG.txt, because when upgrading from one version to another all should be compatible between versions, but this is not the case, thus people should be noticed of that. Regards, 2013/7/31 Alan Woodward > Hi Luis, > > You need to call coreContainer.load() after construction for it to load > the cores. Previously the CoreContainer(solrHome, configFile) constructor > also called load(), but this was the only constructor to do that. > > I probably need to put something in CHANGES.txt to point this out... > > Alan Woodward > www.flax.co.uk > > > On 31 Jul 2013, at 08:53, Luis Cappa Banda wrote: > > > Hello guys, > > > > Since I upgrade from 4.1.0 to 4.4.0 version I've noticed that > > EmbeddedSolrServer has changed a little the way of construction: > > > > *Solr 4.1.0 style:* > > > > CoreContainer coreContainer = new CoreContainer(*solrHome, new > > File(solrHome+"/solr.xml"*)); > > EmbeddedSolrServer localSolrServer = new > EmbeddedSolrServer(coreContainer, > > core); > > > > *Solr 4.4.0 new style: > > * > > > > CoreContainer coreContainer = new CoreContainer(*solrHome*); > > EmbeddedSolrServer localSolrServer = new > EmbeddedSolrServer(coreContainer, > > core); > > > > > > However, it's not working. I've got the following solr.xml configuration > > file: > > > > * > hostPort="${jetty.port:8983}" hostContext="${hostContext:solr}" > > zkClientTimeout="${zkClientTimeout:15000}"> > > * > > ** > > ** > > ** > > > > > > And resources appears to be loaded correctly: > > > > *2013-07-31 09:46:37,583 47889 [main] INFO > org.apache.solr.core.ConfigSolr > > - Loading container configuration from /opt/solr/solr.xml* > > > > > > But when indexing into core with coreName 'core', it throws an Exception: > > > > *2013-07-31 09:50:49,409 5189 [main] ERROR > > com.buguroo.solr.index.WriteIndex - No such core: core* > > > > Or I am sleppy, something that's possible, or there is some kind of bug > > here. > > > > Best regards, > > > > -- > > - Luis Cappa > > -- - Luis Cappa
Re: Distributed MLT is slow
Is distributed MLT officially released or you are using a patch? El martes, 20 de agosto de 2013, Shawn Heisey escribió: > Before I file an issue on this, I wanted to bring it up here, so I can see > if there's something I'm overlooking. > > Distributed MLT is very very slow for me. I can make it work, but a QTime > of one to two minutes in production isn't acceptable. Sending a > non-distributed MLT request directly to a large shard takes about 1.5 > seconds. There are six large cold shards and one tiny hot shard. > > I used my dev server to gather some logs. This server is considerably > less powerful than my production servers, but has exactly the same data. > It's running a 4.5 snapshot with the patch from SOLR-5125. Unlike my > production servers, the dev server takes over four minutes for the > distributed MLT request. Slightly redacted logfile at this URL: > > https://dl.dropboxusercontent.**com/u/97770508/slow-mlt.log<https://dl.dropboxusercontent.com/u/97770508/slow-mlt.log> > > After I ran the query that you can see in the logfile, I restarted Solr on > my dev server and ran one of the slow subrequests directly to a shard. > Here's the debugQuery timing section from that request. QTime on it was > 56506: > > "QParser":"LuceneQParser", > "timing":{ > "time":56504.0, > "prepare":{ > "time":29.0, > "query":{ > "time":29.0}, > "facet":{ > "time":0.0}, > "mlt":{ > "time":0.0}, > "highlight":{ > "time":0.0}, > "stats":{ > "time":0.0}, > "spellcheck":{ > "time":0.0}, > "debug":{ > "time":0.0}}, > "process":{ > "time":56475.0, > "query":{ > "time":935.0}, > "facet":{ > "time":0.0}, > "mlt":{ > "time":55442.0}, > "highlight":{ > "time":0.0}, > "stats":{ > "time":0.0}, > "spellcheck":{ > "time":0.0}, > "debug":{ > "time":98.0} > > Is there anything for me to do other than file an issue? > > Thanks, > Shawn > -- - Luis Cappa
Re: SOLR Prevent solr of modifying fields when update doc
Hi, The uuid, that was been used like the id of a document, it's generated by solr using an updatechain. I just use the recommend method to generate uuid's. I think an atomic update is not suitable for me, because I want that solr indexes the feeds and not me. I don't want to send information to solr, I want that indexes it each 15 minutes, for example, and now it's doing that. Lance, I don't understand what you want to say with, software that I use to index. I just use solr. I have a configuration with two entities. One that selects my rss sources from a database and then the main entity that get information from an URL and processes it. Thank you all for the answers. Much appreciated On Saturday, August 24, 2013, Greg Preston wrote: > But there is an API for sending a delta over the wire, and server side it > does a read, overlay, delete, and insert. And only the fields you sent > will be changed. > > *Might require your unchanged fields to all be stored, though. > > > -Greg > > > On Fri, Aug 23, 2013 at 7:08 PM, Lance Norskog > > > wrote: > > > Solr does not by default generate unique IDs. It uses what you give as > > your unique field, usually called 'id'. > > > > What software do you use to index data from your RSS feeds? Maybe that is > > creating a new 'id' field? > > > > There is no partial update, Solr (Lucene) always rewrites the complete > > document. > > > > > > On 08/23/2013 09:03 AM, Greg Preston wrote: > > > >> Perhaps an atomic update that only changes the fields you want to > change? > >> > >> -Greg > >> > >> > >> On Fri, Aug 23, 2013 at 4:16 AM, Luís Portela Afonso > >> > wrote: > >> > >>> Hi thanks by the answer, but the uniqueId is generated by me. But when > >>> solr indexes and there is an update in a doc, it deletes the doc and > >>> creates a new one, so it generates a new UUID. > >>> It is not suitable for me, because i want that solr just updates some > >>> fields, because the UUID is the key that i use to map it to an user in > my > >>> database. > >>> > >>> Right now i'm using information that comes from the source and never > >>> chages, as my uniqueId, like for example the guid, that exists in some > rss > >>> feeds, or if it doesn't exists i use link. > >>> > >>> I think there is any simple solution for me, because for what i have > >>> read, when an update to a doc exists, SOLR deletes the old one and > create a > >>> new one, right? > >>> > >>> On Aug 23, 2013, at 12:07 PM, Erick Erickson > >>> > > > >>> wrote: > >>> > >>> Well, not much in the way of help because you can't do what you > want AFAIK. I don't think UUID is suitable for your use-case. Why not > use your ? > > Or generate something yourself... > > Best > Erick > > > On Thu, Aug 22, 2013 at 5:56 PM, Luís Portela Afonso < > meligalet...@gmail.com > > > wrote: > > Hi, > > > > How can i prevent solr from update some fields when updating a doc? > > The problem is, i have an uuid with the field name uuid, but it is > not > > an > > unique key. When a rss source updates a feed, solr will update the > doc > > with > > the same link but it generates a new uuid. This is not the desired > > because > > this id is used by me to relate feeds with an user. > > > > Can someone help me? > > > > Many Thanks > > > > > > -- Sent from Gmail Mobile
Re: SOLR Prevent solr of modifying fields when update doc
Hi, right now I'm using the link field that comes in any rss entry as my uniqueKey. That was the best solution that I found because in many updated documents, this was the only field that never changes. Now I'm facing another problem. When I want to search for a document with that id or link, because that is my uniqueKey, I'm not able to get an unique result. I can't successfully search for a field that is a URL on solr. I think that is because I'm encoding the URL that I'm searching for, but solr doesn't decodes it. Thanks for the concern and help On Saturday, August 24, 2013, Erick Erickson wrote: > bq: but the uniqueId is generated by me. But when solr indexes and there > is an update in a doc, it deletes the doc and creates a new one, so it > generates a new UUID. > > right, this is why I was saying that a UUID field may not fit your use > case. The _point_ of a UUID field is to generate a unique entry for every > added document, there's no concept of "only generate the UUID once per > indexed" which seems to be what you want. > > So I'd do something like just use the field rather than a > separate UUID field. That doesn't change by definition. What advantage do > you think you get from the UUID field over just using your > field? > > Best, > Erick > > > On Sat, Aug 24, 2013 at 6:26 AM, Luis Portela Afonso < > meligalet...@gmail.com > > wrote: > > > Hi, > > > > The uuid, that was been used like the id of a document, it's generated by > > solr using an updatechain. > > I just use the recommend method to generate uuid's. > > > > I think an atomic update is not suitable for me, because I want that solr > > indexes the feeds and not me. I don't want to send information to solr, I > > want that indexes it each 15 minutes, for example, and now it's doing > that. > > > > Lance, I don't understand what you want to say with, software that I use > to > > index. > > I just use solr. I have a configuration with two entities. One that > selects > > my rss sources from a database and then the main entity that get > > information from an URL and processes it. > > > > Thank you all for the answers. > > Much appreciated > > > > On Saturday, August 24, 2013, Greg Preston wrote: > > > > > But there is an API for sending a delta over the wire, and server side > it > > > does a read, overlay, delete, and insert. And only the fields you sent > > > will be changed. > > > > > > *Might require your unchanged fields to all be stored, though. > > > > > > > > > -Greg > > > > > > > > > On Fri, Aug 23, 2013 at 7:08 PM, Lance Norskog > > > > > > > > > wrote: > > > > > > > Solr does not by default generate unique IDs. It uses what you give > as > > > > your unique field, usually called 'id'. > > > > > > > > What software do you use to index data from your RSS feeds? Maybe > that > > is > > > > creating a new 'id' field? > > > > > > > > There is no partial update, Solr (Lucene) always rewrites the > complete > > > > document. > > > > > > > > > > > > On 08/23/2013 09:03 AM, Greg Preston wrote: > > > > > > > >> Perhaps an atomic update that only changes the fields you want to > > > change? > > > >> > > > >> -Greg > > > >> > > > >> > > > >> On Fri, Aug 23, 2013 at 4:16 AM, Luís Portela Afonso > > > >> > wrote: > > > >> > > > >>> Hi thanks by the answer, but the uniqueId is generated by me. But > > when > > > >>> solr indexes and there is an update in a doc, it deletes the doc > and > > > >>> creates a new one, so it generates a new UUID. > > > >>> It is not suitable for me, because i want that solr just updates > some > > > >>> fields, because the UUID is the key that i use to map it to an user > > in > > > my > > > >>> database. > > > >>> > > > >>> Right now i'm using information that comes from the source and > never > > > >>> chages, as my uniqueId, like for example the guid, that exists in > > some > > > rss > > > >>> feeds, or if it doesn't exists i use link. > > > >>> > > > >>> I think there is any simple solution for
Solr documents update on index
Hi, I'm having a problem when solr indexes. It is updating documents already indexed. Is this a normal behavior? If a document with the same key already exists is it supposed to be updated? I has thinking that is supposed to just update if the information on the rss has changed. Appreciate your help -- Sent from Gmail Mobile
Re: Data import
So I'm indexing RSS feeds. I'm running the data import full-import command with a cron job. It runs every 15 minutes and indexes a lot of RSS feeds from many sources. With cron job, I do a http request using curl, to the address http://localhost:port/solr/core/dataimport/?command=full-import&clean=false When it runs, if the rss source has a feed that is already indexed on solr, it updates the existing source. So if the source has the same information of the destiny, it updates the information on the destiny. I want to prevent that. Is that explicit? I may try to provide some examples. Thanks On Tuesday, September 10, 2013, Chris Hostetter wrote: > > : When i run "dataimport/?command=full-import&clean=false", solr add new > : documents with the information. But if the same information already > : exists with the same uniquekey, it replaces the existing document with a > : new one. > : It does not update the document, it creates a new one. It's that > possible? > > I'm not certain that i'm understanding your question. > > It is possible using Atomic Updates, but you have to be explicit > about what/how you wnat Solr to use the new information (ie: when to > replace, when to add to a multivaluded field, when to increment a numeric > field, etc...) > > https://wiki.apache.org/solr/Atomic_Updates > > I don't think DIH has any straight forward syntax for letting you > configure this easily, but as long as you put a "map" in each > field (ie: via ScriptTransformer perhaps) containing a single "modifier => > value" pair you want applied to that field, it should work. > > : I'm indexing rss feeds. I run the rss example that exists in the solr > : examples, and i does that. > > Can you please be more specific about what you would like to see happen, > we can better understand what your actual goal is? It's really not clear > if using Atomic Updates is the easiest way to achieve what you're after, > or if I'm just completley missunderstanding your question... > > https://wiki.apache.org/solr/UsingMailingLists > > -Hoss > -- Sent from Gmail Mobile
Re: Data import
But with atomic updates i need to send the information, right? I want that solr automatic indexes it. And he is doing that. Can you look at the solr example in the source? There is an example on example-DIH folder. Imagine that you run the URL to import the data every 15 minutes. If the same information is already indexed, solr will update it, and by update I mean delete and index again. I just want that solr simple discards the information if this already exists with indexed. On Tuesday, September 10, 2013, Chris Hostetter wrote: > > : With cron job, I do a http request using curl, to the address > : http://localhost:port > /solr/core/dataimport/?command=full-import&clean=false > : > : When it runs, if the rss source has a feed that is already indexed on > solr, > : it updates the existing source. > : So if the source has the same information of the destiny, it updates the > : information on the destiny. > : > : I want to prevent that. Is that explicit? I may try to provide some > : examples. > > Yes, specific examples would be helpful -- it's not really clear what it > is that you want to prevent. > > Please note the URL i mentioned before and use it as a guideline for > how much detail we need to understand what it is you are asking... > > : > Can you please be more specific about what you would like to see > happen, > : > we can better understand what your actual goal is? It's really not > clear > > : > https://wiki.apache.org/solr/UsingMailingLists > > > > -Hoss > -- Sent from Gmail Mobile
Quick question about indexing with SolrJ.
Is it possible to index plain String JSON documents using SolrJ? I already know annotating POJOs works fine, but I need a more flexible way to index data without any intermediate POJO. That's because when changing, adding or removing new fields I don't want to change continously that POJO again and again. -- - Luis Cappa
Re: Quick question about indexing with SolrJ.
Hello, Jack. I don't want to use POJOs, that's the main problem. I know that you can send AJAX POST HTTP Requests with JSON data to index new documents and I would like to do that with SolrJ, that's all, but I don't find the way to do that, :-/ . What I would like to do is simple retrieve an String with an embedded JSON and add() it via an HttpSolrServer object instance. If the JSON matches the Solr server schema.xml or not it would be a server-side problem, not a client-side one. I mean, I want to use a best effort and more flexible way to index data, and using POJOs is not the way to do that: you have to change the Java class, compile it again and relaunch whatever the process that uses that Java class. Regards, - Luis Cappa 2013/5/13 Jack Krupansky > Do your POJOs follow a simple flat data model that is 100% compatible with > Solr? > > If so, maybe you can simply ingest them by setting the Content-type to > "application/json" and maybe having to put some minimal wrapper around the > raw JSON. > > But... if they DON'T follow a simple, flat data model, then YOU are going > to have to transform their data into a format that does have a simple, flat > data model. > > -- Jack Krupansky > > -Original Message- From: Luis Cappa Banda > Sent: Monday, May 13, 2013 10:52 AM > To: solr-user@lucene.apache.org > Subject: Quick question about indexing with SolrJ. > > > Is it possible to index plain String JSON documents using SolrJ? I already > know annotating POJOs works fine, but I need a more flexible way to index > data without any intermediate POJO. > > That's because when changing, adding or removing new fields I don't want to > change continously that POJO again and again. > > > -- > - Luis Cappa > -- - Luis Cappa