The other thing I would be curious about is in your reindexing process, do you clear out the entire index before hand? if so perhaps there is content missing/moved
On Thu, Feb 14, 2019 at 11:07 AM Erick Erickson <erickerick...@gmail.com> wrote: > Basically, this is not possible ;). Therefore there's something I > don't understand.... > > There's nothing anywhere except what's in the index. By that I mean that > _if_ > you copy an index (the data directory and children) from one place to > another, > that's all there is. No information about what's in the index is stored > anywhere > else. So there are a couple of possibilities I see: > > 1> Your rsync isn't doing what you think. By that I mean that "somehow" it > isn't > copying segments (perhaps with the same name, although the size and time > checks would make it extremely unlikely to skip one). What happens if > you _delete_ the data index on your target system first? > > 2> I'm not entirely sure what happens if there are multiple > "segments_n" files. in > the index. That file "points" to all the current segments. From a strictly > theoretical standpoint, my _guess_ is that Lucene chooses the one with the > highest "_n" value. So if you have multiple ones of those, it would be > interesting > to know, > > 3> Has Solr been restarted (or at least the core reloaded) on the target? > > So here's the experiment I'd run: > 1> shut down the Solr running on the target > 2> delete the data dir. > 3> restart Solr and verify that you have zero docs. This will recreate > the data dir and verify that that Solr instance is pointing where you > think it is as a sanity check. > 4> stop Solr again on the target. > 5> do a hard commit on the source. > 6> get a a long listing "ls -l" on your source index. This should be a > lot of flies like _0.tim, _0.fdt...., _1.tim, _1.fdt.... etc . > 7> do your rsync. You should _not_ be indexing to the source at this time. > 8> start Solr on the target. > 9> check the target again. Assuming that you have _not_ been adding > any documents to the source system during the rsync, I'd be stunned if > there were any differences. > 10> If there are incorrect counts or other anomalies: > 10.1> double-check your rsync. Is it really getting the files from your > source? > 10.2> compare the long listing from your index you took in <6> with > the target. Are all files identical size-wise? Are there any files on > the target that are not on the source and vice-versa? If there are > differences, that would explain your issues and would point to your > rsync process being messed up. > > If the index directories are identical on the source and target and > you _still_ see differences then there's an alternate reality that we > occupy ;). > > And the Alfresco folks would probably be the ones to contact. > > Best, > Erick > > > > On Wed, Feb 13, 2019 at 11:28 PM Mathieu Menard > <mathieu.men...@realdolmen.com> wrote: > > > > Hello Andrea, > > > > I'm really sorry for the delay of my answer but I beed more information > before answer you. > > > > Yes 5.365.213 is the numDocs you got just after the sync and yes > 4.537.651 is the numDocs you got in the staging server after the reindexing > and the colleague who realized the rsync confirm that it has been entirely > completed. > > > > I don't see any transaction not completed that normaly means that the > indexation is completed. That's why I don't understand the difference. > > > > Kind Regards > > > > Matthieu > > > > ----Original Message----- > > From: Andrea Gazzarini [mailto:a.gazzar...@sease.io] > > Sent: samedi 9 février 2019 16:56 > > To: solr-user@lucene.apache.org > > Subject: Re: Solr Index Size after reindex > > > > Yes, those numbers are different and that should explain the different > size. I think you should be able to find some information in the Alfresco > or Solr log. There must be a reason about the missing content. > > For example, are those numbers coming from two comparable snapshots? In > other words, I imagine that at a given moment X you rsync-ed the two servers > > > > * 5.365.213 is the numDocs you got just after the sync, isn't it? > > * 4.537.651 is the numDocs you got in the staging server after the > > reindexing isn't it? Are you sure the whole reindexing is completed? > > > > MaxDocs is the number of documents you have in the index including the > deleted docs not yet cleared by a merge. In the console you should also see > the "Deleted docs" count which should be equal to (maxdocs - numdocs) > > > > Ciao > > > > Andrea > > > > On 08/02/2019 15:53, Mathieu Menard wrote: > > > > > > Hi Andrea, > > > > > > I've checked this information and here is the result: > > > > > > > > > > > > PRODUCTION > > > > > > > > > > > > STAGING > > > > > > *numDocs* > > > > > > > > > > > > 5.365.213 > > > > > > > > > > > > 4.537.651 > > > > > > *MaxDoc* > > > > > > > > > > > > 5.845.469 > > > > > > > > > > > > 5.129.556 > > > > > > It seems that there is more than 800.00 docs in PRODUCTION that will > > > explain the size of indexes more important. But there is a thing that > > > I don't understand, we have copied the DB and the contenstore the > > > numDocs for the two environments should be the same no? > > > > > > Could you also explain me the meaning of the maxDocs value pleases? > > > > > > Thanks > > > > > > Matthieu > > > > > > *From:*Andrea Gazzarini [mailto:a.gazzar...@sease.io] > > > *Sent:* vendredi 8 février 2019 14:54 > > > *To:* solr-user@lucene.apache.org > > > *Subject:* Re: Solr Index Size after reindex > > > > > > Hi Mathieu, > > > what about the docs in the two infrastructures? Do they have the same > > > numbers (numdocs / maxdocs)? Any meaningful message (error or not) in > > > log files? > > > > > > Andrea > > > > > > On 08/02/2019 14:19, Mathieu Menard wrote: > > > > > > Hello, > > > > > > I would like to have your point of view about an observation we > > > have made on our two alfresco install (Production and Staging > > > environment) and more specifically on the size of our solr indexes > > > on these two environments. > > > > > > Regularly we do a rsync between the Production and the Staging > > > environment, we make a copy of the Alfresco's DB and a copy of the > > > entire contenstore after that we reindex all the alfresco content. > > > > > > We have noticed that for the production environment we have 19 Gb > > > of indexes while in the staging we have "only" 11. Gb of indexes. > > > We have some difficulties to understand this difference because we > > > assume that the indexes optimization in the same for a full > > > reindex or for the normal use of solr. > > > > > > I've verified the configuration between the two solr instances and > > > I don't see any differences could you help me to better understand > > > this phenomenon. > > > > > > Here you can find some information about our two environment, if > > > you need more details, I will give you as soon as possible: > > > > > > > > > > > > PRODUCTION > > > > > > > > > > > > STAGING > > > > > > Alfresco version > > > > > > > > > > > > 5.1.1.4 > > > > > > > > > > > > 5.1.1.4 > > > > > > Solr Version > > > > > > > > > > > > > > > > > > Java version > > > > > > > > > > > > > > > > > > Linux Machine > > > > > > > > > > > > See Staging_caracteristics.txt file in attachment > > > > > > > > > > > > See Staging_caracteristics.txt file in attachment > > > > > > Please let me know if you any other information I will sent it to > > > you rapidly. > > > > > > Kind Regards > > > > > > Matthieu > > > >