Re: Indexing performance 7.3 vs 8.7

2020-12-23 Thread Bram Van Dam
On 23/12/2020 16:00, Ron Buchanan wrote: > - both run Java 1.8, but 7.3 is running HotSpot and 8.7 is running > OpenJDK (and a bit newer) If you're using G1GC, you probably want to give Java 11 a go. It's an easy thing to test, and it's had a positive impact for us. Your mileage may va

Re: Prevent Re-indexing if Doc Fields are Same

2020-06-26 Thread Walter Underwood
eld), it does an in-place update without deletion. But the > problem is I don't know if the document is present or I'm indexing it the > first time. > > Is there a way to prevent re-indexing if other fields are the same? > > *P.S. I'm looking for a solution that doesn't require looking up if doc is > present in the Collection or not.*

Prevent Re-indexing if Doc Fields are Same

2020-06-26 Thread Anshuman Singh
all the fields, it deletes the document and re-index it. But if I just "set" the "LASTUPDATETIME" field (non-indexed, non-stored, docValue field), it does an in-place update without deletion. But the problem is I don't know if the document is present or I'm indexing it

Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Fiz N
Thanks Erick... On Sun, Jun 7, 2020 at 1:50 PM Erick Erickson wrote: > https://lucidworks.com/post/indexing-with-solrj/ > > > > On Jun 7, 2020, at 3:22 PM, Fiz N wrote: > > > > Thanks Jorn and Erick. > > > > Hi Erick, looks like the skeletal SOLRJ program attachment is missing. > > > > Thanks >

Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Erick Erickson
https://lucidworks.com/post/indexing-with-solrj/ > On Jun 7, 2020, at 3:22 PM, Fiz N wrote: > > Thanks Jorn and Erick. > > Hi Erick, looks like the skeletal SOLRJ program attachment is missing. > > Thanks > Fiz > > On Sun, Jun 7, 2020 at 12:20 PM Erick Erickson > wrote: > >> Here’s a skele

Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Fiz N
Thanks Jorn and Erick. Hi Erick, looks like the skeletal SOLRJ program attachment is missing. Thanks Fiz On Sun, Jun 7, 2020 at 12:20 PM Erick Erickson wrote: > Here’s a skeletal SolrJ program using Tika as another alternative. > > Best, > Erick > > > On Jun 7, 2020, at 2:06 PM, Jörn Franke w

Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Erick Erickson
Here’s a skeletal SolrJ program using Tika as another alternative. Best, Erick > On Jun 7, 2020, at 2:06 PM, Jörn Franke wrote: > > You have to write an external application that creates multiple threads, > parses the PDFs and index them in Solr. Ideally you parse the PDFs once and > store th

Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Jörn Franke
You have to write an external application that creates multiple threads, parses the PDFs and index them in Solr. Ideally you parse the PDFs once and store the resulting text on some file system and then index it. Reason is that if you upgrade to two major versions of Solr you might need to reind

Re: Indexing huge data onto solr

2020-05-26 Thread Erick Erickson
ch one of parent > tuples and execute the child entity sql’s(with where condition of parent) to > create one solr document? Won’t it be more load on database by executing more > sqls? Is there an optimum solution? > > Thanks, > Srinivas > From: Erick Erickson > Sent: 22 May 2

RE: Indexing huge data onto solr

2020-05-25 Thread Srinivas Kashyap
22:52 To: solr-user@lucene.apache.org Subject: Re: Indexing huge data onto solr You have a lot more control over the speed and form of importing data if you just do the initial load in SolrJ. Here’s an example, taking the Tika parts out is easy: https://lucidworks.com/post/indexing-with-solrj

Re: Indexing huge data onto solr

2020-05-22 Thread matthew sporleder
I can index (without nested entities ofc ;) ) 100M records in about 6-8 hours on a pretty low-powered machine using vanilla DIH -> mysql so it is probably worth looking at why it is going slow before writing your own indexer (which we are finally having to do) On Fri, May 22, 2020 at 1:22 PM Erick

Re: Indexing huge data onto solr

2020-05-22 Thread Erick Erickson
You have a lot more control over the speed and form of importing data if you just do the initial load in SolrJ. Here’s an example, taking the Tika parts out is easy: https://lucidworks.com/post/indexing-with-solrj/ It’s especially instructive to comment out just the call to CloudSolrClient.add(d

Re: Indexing Korean

2020-05-13 Thread ART GALLERY
check out the videos on this website TROO.TUBE don't be such a sheep/zombie/loser/NPC. Much love! https://troo.tube/videos/watch/aaa64864-52ee-4201-922f-41300032f219 On Mon, May 4, 2020 at 8:33 AM Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > > Oh wow, I had no idea this existed. Thank y

RE: Indexing Korean

2020-05-04 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Oh wow, I had no idea this existed. Thank you so much! Best, Audrey On 5/1/20, 12:58 PM, "Markus Jelsma" wrote: Hello, Although it is not mentioned in Solr's language analysis page in the manual, Lucene has had support for Korean for quite a while now. https://urldefense.proofp

RE: Indexing Korean

2020-05-01 Thread Markus Jelsma
Hello, Although it is not mentioned in Solr's language analysis page in the manual, Lucene has had support for Korean for quite a while now. https://lucene.apache.org/core/8_5_0/analyzers-nori/index.html Regards, Markus -Original message- > From:Audrey Lorberfeld - audrey.lorberf..

Re: Indexing data from multiple data sources

2020-04-20 Thread Charlie Hull
with this link https://sematext.com/opensee/m/Solr/eHNlswSd1vD6AF?subj=RE+Indexing+data+from+multiple+data+sources As it is open to the world, what we are requesting here is, could you please remove that post as-soon-as possible before it creates any sucurity issues for us. Your help is very

Re: Indexing data from multiple data sources

2020-04-18 Thread RaviKiran Moola
[cid:6ccc253a-a590-4e89-b5de-fd9a59d88aba] Thanks & Regards, Ravikiran Moola From: RaviKiran Moola Sent: Friday, April 17, 2020 9:13 PM To: solr-user@lucene.apache.org Subject: RE: Indexing data from multiple data sources Hi, Greetings!!! We are w

Re: Indexing data from multiple data sources

2020-04-17 Thread Jörn Franke
What does your Solr.log say? Any error ? > Am 17.04.2020 um 20:22 schrieb RaviKiran Moola > : > >  > Hi, > > Greetings!!! > > We are working on indexing data from multiple data sources (MySQL & MSSQL) in > a single collection. We specified data source details like connection details > along

RE: Indexing data from multiple data sources

2020-04-17 Thread RaviKiran Moola
Hi, Greetings!!! We are working on indexing data from multiple data sources (MySQL & MSSQL) in a single collection. We specified data source details like connection details along with the required fields for both data sources in a single data config file, along with specified required fields d

Re: In-place re-indexing after DocValue schema change

2020-01-29 Thread moscovig
Tank you Emir. I tried this locally (changing schema, re-index all implace) and I wasn't able to sort on the doc value fields anymore (someone actually mentioned this before on that forum - https://lucene.472066.n3.nabble.com/DocValues-error-td4240116.html) with the next error "Error from server a

Re: In-place re-indexing after DocValue schema change

2020-01-29 Thread Emir Arnautović
Hi, 1. No, it’s not valid. Solr will look at schema to see if it can use docValues or if it has to uninvert field and it assumes that all fields will have doc values. You might expect from wrong results to errors if you do something like that. 2. Not sure if it would work, but It is not better t

In-place re-indexing after DocValue schema change

2020-01-29 Thread moscovig
Hi all We are about to alter our schema with some DocValue annotations. According to docs, we should whether delete all docs and re-insert, or create a new collection with the new schema. 1. Is it valid to modify the schema in the current collection, where all documents were created without docV

Re: need for re-indexing when using managed schema

2019-12-16 Thread Erick Erickson
field won’t return any docs indexed before the change until the older docs are re-indexed. So you can see where this is going. “If you add a field _and then reindex all your documents_, it’s perfectly safe. However, between the time you add the field and the re-indexing is complete, you results

need for re-indexing when using managed schema

2019-12-16 Thread Joseph Lorenzini
Hi all, I have question about the managed schema functionality. According to the docs, "All changes to a collection’s schema require reindexing". This would imply that if you use a managed schema and you use the schema API to update the schema, then doing a full re-index is necessary each time.

Re: Indexing with customized parameters

2019-12-12 Thread Anuj Bhargava
Emir Thanks, Perfect On Thu, 12 Dec 2019 at 13:40, Emir Arnautović wrote: > Hi Anuj, > Maybe I am missing something but this is more question for some SQL group > than for Solr group. I am surprised that you get any records. You can > consult your DB documentation for some more elegant solution

Re: Indexing with customized parameters

2019-12-12 Thread Emir Arnautović
Hi Anuj, Maybe I am missing something but this is more question for some SQL group than for Solr group. I am surprised that you get any records. You can consult your DB documentation for some more elegant solution, but a brute-force solution, if your column is string, could be: WHERE sector = 27

Re: Indexing with customized parameters

2019-12-11 Thread Anuj Bhargava
Any suggestions? Regards, Anuj On Tue, 10 Dec 2019 at 20:52, Anuj Bhargava wrote: > I am trying to index where the *sector field* has the values 27 and/or > 2701 and/or 2702 using the following - > >query="SELECT * FROM country WHERE sector = 27 OR sector = 2701 OR > sector = 2702" > del

Re: Indexing strategies for user profiles

2019-12-10 Thread Dave
I would index the products a user purchased as well as the number of times purchased, then I would take a user, search their bought products boosted by how many times purchased, against other users, have a facet for products and filter out the top bought products that are not on the users alre

Re: Indexing information on number of attachments and their names in EML file

2019-08-14 Thread Zheng Lin Edwin Yeo
Hi Tim, Regarding the returning of the list of Metadata objects, is the code suppose to include the information on the number of attachments in the particular email and/or the name of the attachment? For example, if there are 3 attachments in the email, we should be able to see immediately from th

Re: Indexing information on number of attachments and their names in EML file

2019-08-02 Thread Zheng Lin Edwin Yeo
Thanks for the reply, will find out more about it. Currently I am able to retrieve the normal Metadata of the email, but not the Metadata of the attachments which are part of the contents in the EML file, which looks something like this. --d8b77b057d59ca19-- --d8b77e057d5

Re: Indexing information on number of attachments and their names in EML file

2019-08-02 Thread Tim Allison
I'd strongly recommend rolling your own ingest code. See Erick's superb: https://lucidworks.com/post/indexing-with-solrj/ You can easily get attachments via the RecursiveParserWrapper, e.g. https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParse

Re: Indexing information on number of attachments and their names in EML file

2019-08-02 Thread Jan Høydahl
Try the Apache Tika mailing list. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com > 2. aug. 2019 kl. 05:01 skrev Zheng Lin Edwin Yeo : > > Hi, > > Does anyone knows if this can be done on the Solr side? > Or it has to be done on the Tika side? > > Regards, > Edwin >

Re: Indexing information on number of attachments and their names in EML file

2019-08-01 Thread Zheng Lin Edwin Yeo
Hi, Does anyone knows if this can be done on the Solr side? Or it has to be done on the Tika side? Regards, Edwin On Thu, 1 Aug 2019 at 09:38, Zheng Lin Edwin Yeo wrote: > Hi, > > Would like to check, Is there anyway which we can detect the number of > attachments and their names during indexi

Re: indexing slow in solr 8.0.0

2019-07-12 Thread Jan Høydahl
You reduce cpu in half and see slower indexing. That is to be expected. But you fail to tell us any real details about your setup, your docs, how you index, how you measure throughput, what your bottleneck is etc. Also note that you get better throughput when indexing for the first time than if

Re: Indexing nested document: Solr 8.1.1

2019-07-11 Thread sreejith.variyath
Hi, I was using the url *http://localhost:8983/solr/my-core/update/json/docs*. It was wrong. I should use *http://localhost:8983/solr/my-core/update* and its worked. Thanks -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Indexing in one collection affect index in another collection

2019-04-04 Thread Zheng Lin Edwin Yeo
Hi all, This issue is still surfacing in the new Soir 8.0.0. Can't really figure out what is the issue, as it occurs also in system with more memory. Anyone has any further insights on this? Regards, Edwin On Fri, 15 Feb 2019 at 18:40, Zheng Lin Edwin Yeo wrote: > Hi Shawn, > > This issue is

Re: Indexing PDF files in SqlBase database

2019-04-03 Thread Arunas Spurga
Yes, I know the reasons why put this work on a client rather than use Solr directly and it should be maybe the next my task. But I need to finish first my task - index a pdf files stored in SqlBase database. The pdf files are pretty simple, sometimes only dozens text lines. Regards, Aruna On Wed

Re: Indexing PDF files in SqlBase database

2019-04-03 Thread Erick Erickson
For a lot of reasons, I greatly prefer to put this work on a client rather than use Solr directly. Here’s a place to get started, it connects to a DB and also scans local file directory for docs to push through (local) Tika and index. So you should be able to modify it relatively easily to get t

Re: Indexing in one collection affect index in another collection

2019-02-15 Thread Zheng Lin Edwin Yeo
Hi Shawn, This issue is also occurring in the new Solr 7.7.0, with only the same data size of 20 GB. Regards, Edwin On Fri, 8 Feb 2019 at 23:53, Zheng Lin Edwin Yeo wrote: > Hi Shawn, > > Thanks for your reply. > > Although the space in the OS disk cache could be the issue, but we didn't > fac

Re: Indexing in one collection affect index in another collection

2019-02-08 Thread Zheng Lin Edwin Yeo
Hi Shawn, Thanks for your reply. Although the space in the OS disk cache could be the issue, but we didn't face this problem previously, especially in our other setup using Solr 6.5.1, which contains much more data (more than 1 TB), as compared to our current setup in Solr 7.6.0, in which the dat

Re: Indexing in one collection affect index in another collection

2019-02-06 Thread Shawn Heisey
On 2/6/2019 7:58 AM, Zheng Lin Edwin Yeo wrote: Hi everyone, Does anyone has further updates on this issue? It is my strong belief that all the software running on this server OTHER than Solr is competing with Solr for space in the OS disk cache, and that Solr's data is getting pushed out of

Re: Indexing in one collection affect index in another collection

2019-02-06 Thread Zheng Lin Edwin Yeo
ds. >>> >>> If subsequent queries are fast, then to me it does not seem like a >>> problem for a development machine. For production you may wish to store >>> the indices in ram and/or change from windows to linux, id it is important >>> that all qu

Re: Indexing in one collection affect index in another collection

2019-01-29 Thread Zheng Lin Edwin Yeo
;> >> Have a nice day >> Paul >> >> -Ursprüngliche Nachricht- >> Von: Shawn Heisey >> Gesendet: Dienstag, 29. Januar 2019 13:25 >> An: solr-user@lucene.apache.org >> Betreff: Re: Indexing in one collection affect index in another collection

Re: Indexing in one collection affect index in another collection

2019-01-29 Thread Zheng Lin Edwin Yeo
e > indices in ram and/or change from windows to linux, id it is important that > all queries including the first are very fast. > > Have a nice day > Paul > > -Ursprüngliche Nachricht- > Von: Shawn Heisey > Gesendet: Dienstag, 29. Januar 2019 13:25 > An: solr-us

Re: Indexing in one collection affect index in another collection

2019-01-29 Thread Zheng Lin Edwin Yeo
Hi Shawn, No worries, and thanks for your clarification. We make these changes in order to use the Unifed Highlighter, with hl.offsetSource = POSTING, and add "light" term vectors. The settings comes from what is written in the Solr guide on highlighting, which says the following: *Postings*: S

Re: Indexing in one collection affect index in another collection

2019-01-29 Thread Shawn Heisey
On 1/29/2019 5:25 AM, Shawn Heisey wrote: Adding termVectors will make the index bigger.  Potentially much bigger. This will increase the overall RAM requirement of the server, especially if the server is handling software other than Solr.  Anything that makes the index bigger can affect perfor

Re: Indexing in one collection affect index in another collection

2019-01-29 Thread Shawn Heisey
On 1/29/2019 5:06 AM, Zheng Lin Edwin Yeo wrote: My guess is after we change our searchFields_tcs schema which is: *From*: *To:* Adding termVectors will make the index bigger. Potentially much bigger. This will increase the overall RAM requirement of the server, especially if the server

Re: Indexing in one collection affect index in another collection

2019-01-29 Thread Zheng Lin Edwin Yeo
Hi Shawn, Thanks for you reply. However, we did not delete our index when the screenshot was taken. All the indexes are still in Solr. My guess is after we change our searchFields_tcs schema which is: *From*: *To:* The above change was done in order to use the Solr recommended unified highl

Re: Indexing in one collection affect index in another collection

2019-01-29 Thread Shawn Heisey
On 1/26/2019 4:48 PM, Zheng Lin Edwin Yeo wrote: Thanks for your reply. Below are the replies to your email: 1) We have tried to set the heap size to be 8g previously when we faced the same issue, and changing to 7g does not help too. 2) We are using standard disk at the moment. 3) In the link

Re: Indexing in one collection affect index in another collection

2019-01-29 Thread Zheng Lin Edwin Yeo
Hi Shawn / Jan, Do we have any further insights about this problem? The same problem still happens even after we make the changes and re-index all the data. Regards, Edwin On Sun, 27 Jan 2019 at 07:48, Zheng Lin Edwin Yeo wrote: > Hi Shawn, > > Thanks for your reply. Below are the replies to y

Re: Indexing in one collection affect index in another collection

2019-01-26 Thread Zheng Lin Edwin Yeo
Hi Shawn, Thanks for your reply. Below are the replies to your email: 1) We have tried to set the heap size to be 8g previously when we faced the same issue, and changing to 7g does not help too. 2) We are using standard disk at the moment. 3) In the link is the screenshot of the process list t

Re: Indexing in one collection affect index in another collection

2019-01-26 Thread Shawn Heisey
On 1/26/2019 9:40 AM, Zheng Lin Edwin Yeo wrote: We have tried to add -a "-XX:+AlwaysPreTouch" that starts Solr, but there is no noticeable difference in the performance. As for the screenshot, I have captured another one after we added -a "-XX:+AlwaysPreTouch", and it is sorted on the Working

Re: Indexing in one collection affect index in another collection

2019-01-26 Thread Zheng Lin Edwin Yeo
Hi Shawn, We have tried to add -a "-XX:+AlwaysPreTouch" that starts Solr, but there is no noticeable difference in the performance. As for the screenshot, I have captured another one after we added -a "-XX:+AlwaysPreTouch", and it is sorted on the Working Set column. Below is the link to the new

Re: Indexing in one collection affect index in another collection

2019-01-25 Thread Shawn Heisey
On 1/25/2019 9:11 AM, Zheng Lin Edwin Yeo wrote: As requested, below is the link to the screenshot of the resource monitor of our system. https://drive.google.com/file/d/1_-Tqhk9YYp9w8injHU4ZPSvdFJOx8A5s/view?usp=sharing The wiki page says to sort on the Working Set column. Your screenshot sh

Re: Indexing in one collection affect index in another collection

2019-01-25 Thread Zheng Lin Edwin Yeo
Hi Jorn, I have set the heap size to 6GB, and the system has 32GB of RAM. The data is indexed from CSV file, so each field's data is like database type of data. Only the searchFields may have more data as it contains the important fields of the collection. But then again it is not as large as thi

Re: Indexing in one collection affect index in another collection

2019-01-25 Thread Zheng Lin Edwin Yeo
Hi Shawn, As requested, below is the link to the screenshot of the resource monitor of our system. https://drive.google.com/file/d/1_-Tqhk9YYp9w8injHU4ZPSvdFJOx8A5s/view?usp=sharing Regards, Edwin On Fri, 25 Jan 2019 at 23:35, Shawn Heisey wrote: > On 1/25/2019 7:47 AM, Zheng Lin Edwin Yeo wro

Re: Indexing in one collection affect index in another collection

2019-01-25 Thread Shawn Heisey
On 1/25/2019 7:47 AM, Zheng Lin Edwin Yeo wrote: Below is the command that we used to start Solr: cd solr-7.5.0 bin\solr.cmd start -cloud -p 8983 -s solrMain\node1 -m 6g -z "localhost:2181,localhost:2182,localhost:2183" -Dsolr.ltr.enabled=true pause Can you gather the screenshot mentioned here

Re: Indexing in one collection affect index in another collection

2019-01-25 Thread Zheng Lin Edwin Yeo
Hi Jan, Below is the command that we used to start Solr: cd solr-7.5.0 bin\solr.cmd start -cloud -p 8983 -s solrMain\node1 -m 6g -z "localhost:2181,localhost:2182,localhost:2183" -Dsolr.ltr.enabled=true pause We also have a replica, and in this development setting, we put it in the same PC to s

Re: Indexing in one collection affect index in another collection

2019-01-25 Thread Jan Høydahl
How do you start Solr, cause the solr.in.cmd you sent does not contain the memory settings. What other parameters do you start Solr with? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com > 25. jan. 2019 kl. 15:28 skrev Zheng Lin Edwin Yeo : > > Hi Jan, > > We are usin

Re: Indexing in one collection affect index in another collection

2019-01-25 Thread Zheng Lin Edwin Yeo
Hi Jan, We are using 64 bit Java, version 1.8.0_191. We started Solr with 6 GB heap size. Besides Solr, we have ZooKeeper, IIS, Google Chrome and NotePad++ running on the machine. There is still 22 GB of memory left on the server, out of the 32 GB available on the machine. Regards, Edwin On Fr

Re: Indexing in one collection affect index in another collection

2019-01-25 Thread Jan Høydahl
Which java version? 32 or 64 bit? You start Solr with default 512Mb heap size? Other software running on the machine? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com > 25. jan. 2019 kl. 13:05 skrev Zheng Lin Edwin Yeo : > > Hi Jan and Shawn, > > For your info, this is

Re: Indexing in one collection affect index in another collection

2019-01-25 Thread Jörn Franke
Have you done a correct sizing wrt to memory / CPU? Check also the data model if you have a lot of queried stored fields that may contain a lot of data. You may also split those two collections on different nodes. > Am 23.01.2019 um 18:01 schrieb Zheng Lin Edwin Yeo : > > Hi, > > I am using

Re: Indexing in one collection affect index in another collection

2019-01-25 Thread Zheng Lin Edwin Yeo
Hi Jan and Shawn, For your info, this is another debug query. "debug":{ "rawquerystring":"johnny", "querystring":"johnny", "parsedquery":"searchFields_tcs:johnny", "parsedquery_toString":"searchFields_tcs:johnny", "explain":{ "192280":"\n12.8497505 = weight(searc

Re: Indexing in one collection affect index in another collection

2019-01-25 Thread Zheng Lin Edwin Yeo
Hi Jan and Shawn, Please focus on the strange issue that I have described above in more details, summary is as follows: 1. Index customers data, then queries from highlight, select, and all handlers are very fast (less than 50ms) 2. Now index policies data, then queries on polices are very fast

Re: Indexing in one collection affect index in another collection

2019-01-25 Thread Zheng Lin Edwin Yeo
Hi Jan, Referring to what you have mentioned that the highlighting takes up most of the time in the first query from the policies collection, the highlighting was very fast (less than 50ms) from the time it was indexed, till the time after customers collection gets indexed, in which it slowed down

Re: Indexing in one collection affect index in another collection

2019-01-24 Thread Zheng Lin Edwin Yeo
Hi Jan, Thanks for your reply. However, we are still getting a slow QTime of 517ms even after we set hl=false&fl=null. Below is the debug query: "debug":{ "rawquerystring":"cherry", "querystring":"cherry", "parsedquery":"searchFields_tcs:cherry", "parsedquery_toString":"search

Re: Indexing in one collection affect index in another collection

2019-01-24 Thread Jan Høydahl
Looks like highlighting takes most of the time on the first query (680ms). You config seems to ask for a lot of highlighting here, like 100 snippets of max 10 characters etc. Sounds to me that this might be a highlighting configuration problem. Try to disable highlighting (hl=false) and see

Re: Indexing in one collection affect index in another collection

2019-01-24 Thread Zheng Lin Edwin Yeo
Thanks for your reply. Below are what you have requested about our Solr setup, configurations files ,schema and results of debug queries: Looking forward to your advice and support on our problem. 1. System configurations OS: Windows 10 Pro 64 bit System Memory: 32GB CPU: Intel(R) Core(TM) i7-47

Re: Indexing in one collection affect index in another collection

2019-01-24 Thread Jan Høydahl
It would be useful if you can disclose the machine configuration, OS, memory, settings etc, as well as solr config including solr.in .sh, solrconfig.xml etc, so we can see the whole picture of memory, GC, etc. You could also specify debugQuery=true on a slow search and check the

Re: Indexing in one collection affect index in another collection

2019-01-24 Thread Zheng Lin Edwin Yeo
Hi Shawn, Unfortunately your reply of memory may not be valid. Please refer to my explanation below of the strange behaviors (is it much more like a BUG than anything else that is explainable): Note that we still have 18GB of free unused memory on the server. 1. We indexed the first collection c

Re: Indexing in one collection affect index in another collection

2019-01-24 Thread Zheng Lin Edwin Yeo
Hi Shawn, > If the two collections have data on the same server(s), I can see this > happening. More memory is consumed when there is additional data, and > when Solr needs more memory, performance might be affected. The > solution is generally to install more memory in the server. I have found

Re: Indexing in one collection affect index in another collection

2019-01-23 Thread Zheng Lin Edwin Yeo
Hi Shawn, Thanks for your reply. The log only shows a list the following and I don't see any other logs besides these. 2019-01-24 02:47:57.925 INFO (qtp2131952342-1330) [c:collectioin1 s:shard1 r:core_node4 x:policies_shard1_replica_n2] o.a.s.u.p.StatelessScriptUpdateProcessorFactory update-sc

Re: Indexing in one collection affect index in another collection

2019-01-23 Thread Shawn Heisey
On 1/23/2019 10:01 AM, Zheng Lin Edwin Yeo wrote: I am using Solr 7.5.0, and currently I am facing an issue of when I am indexing in collection2, the indexing affects the records in collection1. Although the records are still intact, it seems that the settings of the termVecotrs get wipe out, and

Re: Enquiry about scheduling for re-indexing

2018-11-29 Thread Shawn Heisey
On 11/29/2018 6:00 AM, Alexandre Rafalovitch wrote: Solr does not have a built-in scheduler for triggering indexing. Only for triggering commits and purging auto-expiring records. So, if you want to trigger DIH indexing, you need to use an external scheduling mechanism for that. What Alexandre

Re: Enquiry about scheduling for re-indexing

2018-11-29 Thread Alexandre Rafalovitch
Solr does not have a built-in scheduler for triggering indexing. Only for triggering commits and purging auto-expiring records. So, if you want to trigger DIH indexing, you need to use an external scheduling mechanism for that. Regards, Alex. On Thu, 29 Nov 2018 at 01:03, Ma Man wrote: > > To

Enquiry about scheduling for re-indexing

2018-11-28 Thread Ma Man
To whom it might concern, Recently, I am studying if Apache Solr able to re-index (Full Import / Delta Import) periodically by configuration instead of triggering by URL ( e.g. http://localhost:8983/solr/{collection_name}/dataimport?command=full-import ) in scheduler tool. Version of the Solr usi

RE: indexing multiple levels of data

2018-11-16 Thread Martin Frank Hansen (MHQ)
. Right now we have -Original Message- From: Jan Høydahl Sent: 16. november 2018 15:29 To: solr-user Subject: Re: indexing multiple levels of data Hi Martin, For a complex use case as this I would recommend you write a separate indexer application that crawls the files, looks up the

Re: indexing multiple levels of data

2018-11-16 Thread Jan Høydahl
Hi Martin, For a complex use case as this I would recommend you write a separate indexer application that crawls the files, looks up the correct metadata XMLs based on given business rules, and then constructs the full Solr document to send to Solr. Even parsing full-text from PDF etc I would r

Re: Indexing vs Search node

2018-11-14 Thread Fernando Otero
Thanks everyone this gave me great arguments for migrating to Solr7 :D On Fri, Nov 9, 2018 at 7:50 PM Shawn Heisey wrote: > On 11/9/2018 1:58 PM, David Hastings wrote: > > I personally like standalone solr for this reason, i can tune the > indexing > > "master" for doing nothing but taking in do

Re: Indexing vs Search node

2018-11-09 Thread Shawn Heisey
On 11/9/2018 1:58 PM, David Hastings wrote: I personally like standalone solr for this reason, i can tune the indexing "master" for doing nothing but taking in documents and that way the slaves dont battle for resources in the process. SolrCloud can be set up pretty similar to this if you're ru

Re: Indexing vs Search node

2018-11-09 Thread David Hastings
I personally like standalone solr for this reason, i can tune the indexing "master" for doing nothing but taking in documents and that way the slaves dont battle for resources in the process. On Fri, Nov 9, 2018 at 3:10 PM Erick Erickson wrote: > Fernando: > > I'd phrase it more strongly than Sh

Re: Indexing vs Search node

2018-11-09 Thread Erick Erickson
Fernando: I'd phrase it more strongly than Shawn. Prior to 7.0 all replicas both indexed and search (they were NRT replica), so there wasn't any choice but to index and search on every replica. It's one of those things that if you have very high throughput (indexing) situations, you _might_ want

Re: Indexing vs Search node

2018-11-09 Thread Shawn Heisey
On 11/9/2018 12:13 PM, Fernando Otero wrote: I read in several blog posts that it's never a good idea to index and search on the same node. I wonder how that can be achieved in Solr Cloud or if it happens automatically. I would disagree with that blanket assertion. Indexing does put extra

RE: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread Phil Scadden
mmit(solr, "prindex"); return true; -Original Message- From: Erick Erickson Sent: Wednesday, 31 October 2018 06:00 To: solr-user Subject: Re: Indexing PDF file in Apache SOLR via Apache TIKA All of the above work, but for robust production situations you'll wa

Re: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread ☼ R Nair
I have done a production implementation of this, running for last four months without any issue. Just a resatrt every week of all components. http://blog.cloudera.com/blog/2015/10/how-to-index-scanned-pdfs-at-scale-using-fewer-than-50-lines-of-code/ Best, Ravion On Tue, Oct 30, 2018, 1:00 PM Er

Re: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread Erick Erickson
All of the above work, but for robust production situations you'll want to consider a SolrJ client, see: https://lucidworks.com/2012/02/14/indexing-with-solrj/. That blog combines indexing from a DB and using Tika, but those are independent. Best, Erick On Tue, Oct 30, 2018 at 12:21 AM Kamuela Lau

Re: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread Kamuela Lau
Hi there, Here are a couple of ways I'm aware of: 1. Extract-handler / post tool You can use the curl command with the extract handler or bin/post to upload a single document. Reference: https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html 2. DataImportHa

Re: Indexing documents from S3 bucket

2018-10-08 Thread ☼ R Nair
S3 gives listeners. So tap those listeners when objects are added, updated or deleted and use Solr API to push. That's high level, but I believe doable. I worked on Minio, an open source object storage supporting S3 and could do this because Minio gives me good and stable listeners. Best, Ravion

Re: indexing two words, searching single word

2018-08-03 Thread Susheel Kumar
ze="2" would suffice > > -Ursprüngliche Nachricht- > Von: Alexandre Rafalovitch > Gesendet: Freitag, 3. August 2018 13:33 > An: solr-user > Betreff: Re: indexing two words, searching single word > > But what is your generic problem then. Because you probab

Re: indexing two words, searching single word

2018-08-03 Thread Alexandre Rafalovitch
maxGramSize="15" /> > > > I guess (besides the performance impact) this reduces search results > accuracy? > > -Clemens > > -Ursprüngliche Nachricht- > Von: Markus Jelsma > Gesendet: Freitag, 3. August 2018 12:43 > An: solr-user@lucene.apache.org

RE: indexing two words, searching single word

2018-08-03 Thread Markus Jelsma
Hello, If your case is English you could use synonyms to work around the problem of the few compound words of the language. However, would you be dealing with a Germanic compound language, the HyphenationCompoundWordTokenFilter [1] or DictionaryCompoundWordTokenFilter are a better choice. The f

Re: Indexing part of Binary Documents and not the entire contents

2018-07-06 Thread neotorand
Gus You are never biased. I explored a bit about JesterJ. Looks quite promising. I will keep you posted on my experience to you soon. Regards Neo -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Indexing part of Binary Documents and not the entire contents

2018-07-04 Thread Gus Heck
You might consider using a free tool like JesterJ (www.jesterj.org) which can possibly also automate the acquisition of the documents and transmission to solr. As well as provide a framework for massaging the contents of the document in between (including Tika processing) (Disclaimer: I'm the prim

Re: Indexing Approach

2018-06-28 Thread Shawn Heisey
On 6/27/2018 8:59 PM, solrnoobie wrote: > One last thing though, will a queue based system work for us? Or are there > better suitable implementations? Exactly how you write your indexing software is up to you.  There is no single approach that's the best.  Examine your business needs and the thin

Re: Indexing Approach

2018-06-27 Thread solrnoobie
Thank you very much! This helped a lot in our estimates. One last thing though, will a queue based system work for us? Or are there better suitable implementations? -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Indexing part of Binary Documents and not the entire contents

2018-06-27 Thread neotorand
Thanks Erick I already have gone through the link from tika example you shared. Please look at the code in bold. I believe still the entire contents is pushed to memory with handler object. sorry i copied lengthy code from tika site. Regards Neo *Streaming the plain text in chunks* Sometimes, you

Re: Indexing Approach

2018-06-26 Thread Shawn Heisey
On 6/26/2018 8:24 AM, solrnoobie wrote: > - Each SP call will return 15 result sets. > - Each document can contain 300-1000 child documents. > - If the batch size is 1000, the child documents for each can contain > 300-1000 documents so that will eat up the 4g's allocated to the > application. If

Re: Indexing Approach

2018-06-26 Thread solrnoobie
Thanks for the tip. Although we have increased our application's heap to 4g and it is still not enough. I guess here are the things we think we did wrong: - Each SP call will return 15 result sets. - Each document can contain 300-1000 child documents. - If the batch size is 1000, the child docum

Re: Indexing part of Binary Documents and not the entire contents

2018-06-26 Thread Erick Erickson
Well, if you were using ERH you'd have the same problem as it uses Tika. At least if you run Tika on some client somewhere, if you do have a document that blows out memory or has some other problem, your client can crash without taking Solr with it. That's one of the reasons, in fact, that we don'

Re: Indexing part of Binary Documents and not the entire contents

2018-06-26 Thread Shawn Heisey
On 6/26/2018 7:13 AM, neotorand wrote: Dont you think the below method is very exepensive autoParser.parse(input, textHandler, metadata, context); If the document size if bigger than it will need enough memory to hold the document(ie ContentHandler). Any other alternative? I did find this: h

  1   2   3   4   5   6   7   8   9   10   >