Re: index size, stored vs indexed

2018-11-14 Thread Erick Erickson
Can't really be answered. For instance, stored data is held in *.fdt files and is largely irrelevant to searching since that data is only consulted for returning stored fields of the top N docs. So if your index consists of 90% stored data it's one answer, if 10% it's totally another. the stored da

Re: Index size issue in SOLR-6.5.1

2018-10-08 Thread Dominique Bejean
HI, In the Solr Admin console, you can access for each core to the "Segment info" page. You can see if there are more deleted documents in segments on server X. Dominique Le lun. 8 oct. 2018 à 07:29, SOLR4189 a écrit : > About which details do you ask? Yesterday we restarted all our solr > ser

Re: Index size issue in SOLR-6.5.1

2018-10-07 Thread SOLR4189
About which details do you ask? Yesterday we restarted all our solr services and index size in serverX descreased from 82Gb to 60Gb, and in serverY index size didn't change (49Gb). -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Index size issue in SOLR-6.5.1

2018-10-07 Thread Dominique Bejean
Hi, What about cores segment details in admin UI interface ? More deleted documents ? Regards Dominique Le dim. 7 oct. 2018 à 08:22, SOLR4189 a écrit : > Hi all, > > We use SOLR-6.5.1 and we have very strange issue. In our collection index > size is very different from server to server (33gb

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-19 Thread Alessandro Benedetti
Hi David, good to know that sorting solved your problem. I understand perfectly that given the urgency of your situation, having the solution ready takes priority over continuing with the investigations. I would recommend anyway to open a Jira issue in Apache Solr with all the information gathered

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-18 Thread Howe, David
Hi Erick & Alessandro, I have solved my problem by re-ordering the data in the SQL query. I don't know why it works but it does. I can consistently re-produce the problem without changing anything else except the database table. As our Solr build is scripted and we always build a new Solr s

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-16 Thread Erick Erickson
I didn't mean to imply that _you'd_ changed things, the _defaults_ may have changed. So the "string" fieldType may be defined with docValues="true" in your new schema and "false" in your old schema without you intentionally changing anything at _all_. That's why the LukeRequestHandler will hel

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-16 Thread Howe, David
Hi Erick, I'm 99% sure that I haven't changed the field types between the two snapshots as all of my test runs are completely scripted and build a new Solr server from scratch (both the virtual machine and the Solr software). I can diff the scripts between two runs to make sure I haven't acci

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-16 Thread Erick Erickson
Well, I'm not entirely sure either ;) What I'm seeing. And, BTW, I'm making a couple of assumptions here. In the one listing, your biggest segment starts with _7l and in the other its _zd. The aggregate size is 2,815M for _7l and 705M for _zd. So multiplying the individual files in _zd by 4 (p

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-16 Thread Howe, David
Hi Erick, Thinking some more about the differences between the two sort orders has suggested another possibility. We also have a geo spatial field defined in the index: echo "$(date) Creating geoLocation field" curl -X POST -H 'Content-type:application/json' --data-binary '{ "add-fiel

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-16 Thread Howe, David
Hi Erick, Below is the file listing for when the index is loaded with the table ordered in a way that produces the smaller index. I have checked the console, and we have no deleted docs and we have the same number of docs in the index as there are rows in the staging table that we load from.

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-16 Thread Howe, David
Hi Alessandro, There are 14,061,990 records in the staging table and that is how many documents that we end up with in Solr. I would be surprised if we have a problem with the id, as we use the primary key of the table as the id in Solr so it must be unique. The primary key of the staging ta

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-16 Thread Alessandro Benedetti
It's a silly thing, but to confirm the direction that Erick is suggesting : How many rows in the DB ? If updates are happening on Solr ( causing the deletes), I would expect a greater number of documents in the DB than in the Solr index. Is the DB primary key ( if any) the same of the uniqueKey fie

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-16 Thread Howe, David
Hi Emir, We have no copy field definitions. To keep things simple, we have a one to one mapping between the columns in our staging table and the fields in our Solr index. Regards, David David Howe Java Domain Architect Postal Systems Level 16, 111 Bourke Street Melbourne VIC 3000 T 039106

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-16 Thread Emir Arnautović
Hi David, I skimmed through thread and don’t see if already eliminated, so will ask: Can you check if there are some copyField rules that are triggered when new field is added. You mentioned that ordering fixed the size of the index, but might be worth checking. Emir -- Monitoring - Log Managem

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-15 Thread Erick Erickson
This isn't terribly useful without a similar dump of "the other" index directory. The point is to compare the different extensions some segment where the sum of all the files in that segment is roughly equal. So if you have a listing of the old index around, that would help. bq: We don't have any

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-15 Thread Howe, David
Hi Erick, I have the full dump of the Solr index file sizes as well if that is of any help. I have attached it below this message. We don't have any deleted docs in our index, as we always build it from a brand new virtual machine with a brand new installation of Solr. The ordering is defini

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-15 Thread Erick Erickson
David: Rats, the cfs files make everything I'd hoped to understand with the sizes ambiguous, since they conceal the underlying sizes of each other extension. We can approach it a bit differently though. Take one segment that's _not_ in cfs format where the total size of all files making up that se

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-15 Thread Pratik Patel
@Alessandro I will see if I can reproduce the same issue just by turning off omitNorms on field type. I'll open another mail thread if required. Thanks. On Thu, Feb 15, 2018 at 6:12 AM, Howe, David wrote: > > Hi Alessandro, > > Some interesting testing today that seems to have gotten me closer t

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-15 Thread Howe, David
Hi Alessandro, Some interesting testing today that seems to have gotten me closer to what the issue is. When I run the version of the index that is working correctly against my database table that has the extra field in it, the index suddenly increases in size. This is even though the data i

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-15 Thread Alessandro Benedetti
@Pratik: you should have investigated. I understand that solved your issue, but in case you needed norms it doesn't make sense that cause your index to grow up by a factor of 30. You must have faced a nasty bug if it was just the norms. @Howe : *Compound File* .cfs, .cfe An optional "virtua

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-14 Thread Howe, David
Subject: RE: Index size increases disproportionately to size of added field when indexed=false I have set docValues=false on all of the string fields in our index that have indexed=false and stored=true. This gave a small improvement in the index size from 13.3GB to 12.82GB. I have also tried

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-14 Thread Pratik Patel
You are right, in my case this field type was applied to many text fields. These includes many copy fields and dynamic fields as well. In my case, only specifying omitNorms=true for field type "text_general" fixed the issue. I didn't do anything else or had any other bug. On Wed, Feb 14, 2018 at 1

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-14 Thread Alessandro Benedetti
Hi pratik, how is it possible that just the norms for a single field were causing such a massive index size increment in your case ? In your case I think it was for a field type used by multiple fields, but it's still suspicious in my opinions, norms should be that big. If I remember correctly in

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-14 Thread Erick Erickson
067904 >> >> M 0424036591 >> >> E david.h...@auspost.com.au >> >> W auspost.com.au >> W startrack.com.au >> >> -Original Message- >> From: Howe, David [mailto:david.h...@auspost.com.au] >> Sent: Wednesday, 14 February 2018 7:26 AM >

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-14 Thread Pratik Patel
> Level 16, 111 Bourke Street Melbourne VIC 3000 > > T 0391067904 > > M 0424036591 > > E david.h...@auspost.com.au > > W auspost.com.au > W startrack.com.au > > -Original Message- > From: Howe, David [mailto:david.h...@auspost.com.au] > Se

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Howe, David
auspost.com.au W startrack.com.au -Original Message- From: Howe, David [mailto:david.h...@auspost.com.au] Sent: Wednesday, 14 February 2018 7:26 AM To: solr-user@lucene.apache.org Subject: RE: Index size increases disproportionately to size of added field when indexed=false Thanks Hoss. I will

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Howe, David
Thanks Hoss. I will try setting docValues to false, as we only ever want to be able to retrieve the value of this field. Regards, David David Howe Java Domain Architect Postal Systems Level 16, 111 Bourke Street Melbourne VIC 3000 T 0391067904 M 0424036591 E david.h...@auspost.com.au W

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Howe, David
Hi Erick, Thanks for responding. You are correct that we don't have any deleted docs. When we want to re-index (once a fortnight), we build a brand new installation of Solr from scratch and re-import the new data into an empty index. I will try setting docValues to false and see if that make

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Howe, David
Hi Alessandro, The docker image is like a disk image of the entire server, so it includes the operating system, the Solr installation and the data. Because we run in the cloud and our index isn't that big, this is an easy and fast way for us to scale our Solr cluster without having to configu

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread David Hastings
To piggy back on this, what would be the right scenarios to use docvalues='true'? On Tue, Feb 13, 2018 at 1:10 PM, Chris Hostetter wrote: > > : We are using Solr 7.1.0 to index a database of addresses. We have found > : that our index size increases massively when we add one extra field to > :

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Chris Hostetter
: We are using Solr 7.1.0 to index a database of addresses. We have found : that our index size increases massively when we add one extra field to : the index, even though that field is stored and not indexed, and doesn’t what about docValues? : When we run an index load without the problema

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Erick Erickson
David: Right, Optimize Is Evil. Well, actually in your case it's not. In your specific case you can optimize every time you build your index and be OK, gory details here: https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/ But that's just for background. The key

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Alessandro Benedetti
Hi David, given the fact that you are actually building a new index from scratch, my shot in the dark didn't hit any target. When you say : "Once the import finishes we save the docker image in the AWS docker repository. We then build our cluster using that image as the base" Do you mean just c

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Howe, David
Hi Alessanro, Thanks for responding. We rebuild the index every time starting from a fresh installation of Solr. Because we are running at AWS, we have automated our deployment so we start with the base docker image, configure Solr and then import our data every time the data changes (it onl

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Alessandro Benedetti
I assume you re-index in full right ? My shot in the dark is that this increment is temporary. You re-index, so effectively delete and add all documents ( this means that even if the new field is just stored, you re-build the entire index for all the fields). Create new segments and the old docs ar

Re: Index size optimization between 4.5.1 and 4.10.4 Solr

2017-12-07 Thread Natarajan, Rajeswari
Thanks a lot for the response. We did not change schema or config. We simply opened 4.5 indexes with 4.10 libraries. Thank you, Rajeswari On 12/7/17, 3:17 PM, "Shawn Heisey" wrote: On 12/7/2017 1:27 PM, Natarajan, Rajeswari wrote: > We have upgraded solr from 4.5.1 to 4.10.4 and we see

Re: Index size optimization between 4.5.1 and 4.10.4 Solr

2017-12-07 Thread Shawn Heisey
On 12/7/2017 1:27 PM, Natarajan, Rajeswari wrote: > We have upgraded solr from 4.5.1 to 4.10.4 and we see index size reduction. > Trying to see if any optimization done to decrease the index sizes , couldn’t > locate. If anyone knows why please share. Here's a history where you can see the a s

Re: index size increses dramatically

2016-08-17 Thread Jan Høydahl
Hi It is quite normal that index size can be close to double during background merge of segments. If you have a lot of deletions and/or reindexed docs then the same document may also exist in multiple segments, taking up space temporarily until a merge or optimize. If this slows down your syst

Re: Index size increase after upgrade to 4.9?

2014-07-30 Thread Shawn Heisey
On 7/30/2014 10:00 AM, Shawn Heisey wrote: > It may turn out that this is actually a bug in merging, where old > segments are not getting deleted. I noticed in the optimized index that > there is a single large segment of about 20GB and a bunch of other > segments that are all older than the singl

Re: Index size increase after upgrade to 4.9?

2014-07-30 Thread Shawn Heisey
On 7/30/2014 9:16 AM, Shawn Heisey wrote: > On 7/30/2014 9:10 AM, Erick Erickson wrote: >> I assume you've optimized? Or otherwise insured that there aren't >> any deleted docs > It's all straight indexing with DIH from MySQL, so there really are no > deleted docs, but about an hour after the r

Re: Index size increase after upgrade to 4.9?

2014-07-30 Thread Shawn Heisey
On 7/30/2014 9:10 AM, Erick Erickson wrote: > I assume you've optimized? Or otherwise insured that there aren't > any deleted docs It's all straight indexing with DIH from MySQL, so there really are no deleted docs, but about an hour after the rebuild finished, one of the shards did get optimi

Re: Index size increase after upgrade to 4.9?

2014-07-30 Thread Erick Erickson
I assume you've optimized? Or otherwise insured that there aren't any deleted docs Best, Erick On Wed, Jul 30, 2014 at 6:27 AM, Shawn Heisey wrote: > Yesterday I upgraded my dev server to Solr 4.9, and also upgraded a > third-party plugin to a new version that's compatible with Solr 4.9. >

Re: Index size - to determine storage

2014-01-14 Thread Sumit Arora
Hi Amit, This excel sheet will help you estimating the index size. size-estimator-lucene-solr.xls - Sumit Arora -- View this message in context: http://lucene.472066.n3.nabble.com/Index-size-to-determine

Re: Index size - to determine storage

2014-01-09 Thread Alexandre Rafalovitch
Try running PDF through standalone Tika and see what comes back. That's the size of the input. It usually be quite a small proportion of PDF size. Possibly down to metadata only and no text, if your PDF does not include text layer. Then, it depends on your storing and indexing options, your tokeni

Re: Index size - to determine storage

2014-01-09 Thread Michael Della Bitta
Hi Amit, It really boils down to how much of that 100kb is actually text, and how you analyze and store the text. Meaning, it's really hard for us to say. You're probably going to need to experiment to figure out what the storage needs for your use case are. Michael Della Bitta Applications Deve

Re: index size with replication

2012-03-15 Thread Walter Underwood
No, the deleted files do not get replicated. Instead, the slaves do the same thing as the master, holding on to the deleted files after the new files are copied over. The optimize is obsoleting all of your index files, so maybe should quit doing that. Without an optimize, the deleted files will

Re: index size with replication

2012-03-15 Thread Mike Austin
The problem is that when replicating, the double-size index gets replicated to slaves. I am now doing a dummy commit with always the same document and it works fine.. After the optimize and dummy commit process I just end up with numDocs = x and maxDocs = x+1. I don't get the nice green checkmark

Re: index size with replication

2012-03-15 Thread Erick Erickson
Or just ignore it if you have the disk space. The files will be cleaned up eventually. I believe they'll magically disappear if you simply bounce the server (but work on *nix so can't personally guarantee it). And replication won't replicate the stale files, so that's not a problem either Best

Re: index size with replication

2012-03-14 Thread Mike Austin
Shawn, Thanks for the detailed answer! I will play around with this information in hand. Maybe a second optimize or just a dummy commit after the optimize will help get me past this. Both not the best options, but maybe it's a do it because it's running on windows work-around. If it is indeed a

Re: index size with replication

2012-03-14 Thread Shawn Heisey
On 3/14/2012 2:54 PM, Mike Austin wrote: The odd thing is that if I optimize the index it doubles in size.. If I then, add one more document to the index it goes back down to half size? Is there a way to force this without needing to wait until another document is added? Or do you have more info

Re: index size with replication

2012-03-14 Thread Mike Austin
ot > going to do anything helpful. > > James Dyer > E-Commerce Systems > Ingram Content Group > (615) 213-4311 > > > -Original Message- > From: Ahmet Arslan [mailto:iori...@yahoo.com] > Sent: Wednesday, March 14, 2012 4:25 PM > To: solr-user@lucene.apache.o

RE: index size with replication

2012-03-14 Thread Dyer, James
uot;, then optimize is (probably) not going to do anything helpful. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Wednesday, March 14, 2012 4:25 PM To: solr-user@lucene.apache.org Subject: Re: i

Re: index size with replication

2012-03-14 Thread Ahmet Arslan
> Another note.. if I reload solr app > it goes back down in size. > > here is my replication settings on the master: > > class="solr.ReplicationHandler" > >         >           name="replicateAfter">startup >           name="replicateAfter">commit >           name="replicateAfter">optimize >  

Re: index size with replication

2012-03-14 Thread Mike Austin
Another note.. if I reload solr app it goes back down in size. here is my replication settings on the master: startup commit optimize 1 schema.xml,stopwords.txt,elevate.xml 00:00:30 On Wed, Mar 14, 2012 at 3:54 PM, Mike Aust

Re: index size with replication

2012-03-14 Thread Mike Austin
The odd thing is that if I optimize the index it doubles in size.. If I then, add one more document to the index it goes back down to half size? Is there a way to force this without needing to wait until another document is added? Or do you have more information on what you think is going on? I'm

Re: index size with replication

2012-03-13 Thread Li Li
optimize will generate new segments and delete old ones. if your master also provides searching service during indexing, the old files may be opened by old SolrIndexSearcher. they will be deleted later. So when indexing, the index size may double. But a moment later, old indexes will be deleted.

Re: Index size

2010-02-26 Thread Jean-Sebastien Vachon
Hi, All the document can be up to 10K. Most if it comes from a single field which is both indexed and stored. The data is uncompressed because it would eat up to much CPU considering the volume we have. We have around 30 fields in all. We also need to compute some facets as well as collapse the

Re: Index size

2010-02-25 Thread Otis Gospodnetic
It depends on many factors - how big those docs are (compare a tweet to a news article to a book chapter) whether you store the data or just index it, whether you compress it, how and how much you analyze the data, etc. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop e

Re: index size before and after commit

2009-10-01 Thread Lance Norskog
Ha! Searching "partial optimize" on http://www.lucidimagination.com/search , we discover SOLR-603 which gives the 'maxSegments' option to the command. The text does not include the word 'partial'. It's on http://wiki.apache.org/solr/UpdateXmlMessages. The command gives a number of Lucene segments

Re: index size before and after commit

2009-10-01 Thread Lance Norskog
I've heard there is a new "partial optimize" feature in Lucene, but it is not mentioned in the Solr or Lucene wikis so I cannot advise you how to use it. On a previous project we had a 500GB index for 450m documents. It took 14 hours to optimize. We found that Solr worked well (given enough RAM fo

Re: index size before and after commit

2009-10-01 Thread Walter Underwood
I've now worked on three different search engines and they all have a 3X worst case on space, so I'm familiar with this case. --wunder On Oct 1, 2009, at 7:15 AM, Mark Miller wrote: Nice one ;) Its not technically a case where optimize requires > 2x though in case the user asking gets confuse

Re: index size before and after commit

2009-10-01 Thread Mark Miller
bq. and reindex without any merges. Thats actually quite a hoop to jump as well - though if you determined and you have tons of RAM, its somewhat doable. Mark Miller wrote: > Nice one ;) Its not technically a case where optimize requires > 2x > though in case the user asking gets confused. Its a

Re: index size before and after commit

2009-10-01 Thread Mark Miller
Nice one ;) Its not technically a case where optimize requires > 2x though in case the user asking gets confused. Its a case unrelated to optimize that can grow your index. Then you need < 2x for the optimize, since you won't copy the deletes. It also requires that you jump hoops to delete everyth

Re: index size before and after commit

2009-10-01 Thread Walter Underwood
Here is how you need 3X. First, index everything and optimize. Then delete everything and reindex without any merges. You have one full-size index containing only deleted docs, one full- size index containing reindexed docs, and need that much space for a third index. Honestly, disk is che

Re: index size before and after commit

2009-10-01 Thread Mark Miller
Whoops - they way I have mail come in, not easy to tell if I'm replying to Lucene or Solr list ;) The way Solr works with Searchers and reopen, it shouldn't run into a situation that requires greater than 2x to optimize. I won't guarantee it ;) But based on what I know, it shouldn't happen under n

Re: index size before and after commit

2009-10-01 Thread Mark Miller
Phillip Farber wrote: > I am trying to automate a build process that adds documents to 10 > shards over 5 machines and need to limit the size of a shard to no > more than 200GB because I only have 400GB of disk available to > optimize a given shard. > > Why does the size (du) of an index typically

Re: index size before and after commit

2009-10-01 Thread Grant Ingersoll
It may take some time before resources are released and garbage collected, so that may be part of the reason why things hang around and du doesn't report much of a drop. On Oct 1, 2009, at 8:54 AM, Phillip Farber wrote: I am trying to automate a build process that adds documents to 10 shar

Re: Index size concerns

2009-05-26 Thread Muhammed Sameer
Thank you Otis, I will for sure check on this wa salaam, Muhammed Sameer --- On Tue, 5/26/09, Otis Gospodnetic wrote: > From: Otis Gospodnetic > Subject: Re: Index size concerns > To: solr-user@lucene.apache.org > Date: Tuesday, May 26, 2009, 1:01 PM > > Muhammed, >

Re: Index size concerns

2009-05-26 Thread Otis Gospodnetic
: solr-user@lucene.apache.org > Sent: Monday, May 25, 2009 1:22:15 PM > Subject: Re: Index size concerns > > > Salaam, > > Sorry for this here is the big picture > > Actually we use solr to index all the mails that come to us so that we can > allow > for faster loo

Re: Index size concerns

2009-05-25 Thread Muhammed Sameer
conveying the problem What I wanted to know is that is this index size normal ? Regards, Muhammed Sameer --- On Mon, 5/25/09, Shalin Shekhar Mangar wrote: > From: Shalin Shekhar Mangar > Subject: Re: Index size concerns > To: solr-user@lucene.apache.org > Date: Monday, May 25, 2009, 11:19

Re: Index size concerns

2009-05-25 Thread Shalin Shekhar Mangar
On Mon, May 25, 2009 at 3:53 PM, Muhammed Sameer wrote: > > We are using apache-solr to index our files for faster searches, all things > happen without a problem, my only concern is the size of the cache. > > It seems that the trend is that the if I cache 1 GB of files the index goes > to 800MB i

Re: index size tripled during optimization

2009-01-28 Thread Shalin Shekhar Mangar
Does you index stay at triple size after optimization? It is normal for Lucene to use 2x or upto 3x disk space during optimization but it should fall back to the normal numbers once optimization completes and unused segments are cleaned up due the index deletion policy. If you search for threads i

Re: index size tripled during optimization

2009-01-28 Thread Qingdi
Hi Ryuuichi, Thanks for your quick reply. I checked the setting of in solrconfig.xml, and the value is 'false'. Here is what in our solrconfig.xml. === false 1000 1 2147483647 10 1000

Re: index size tripled during optimization

2009-01-28 Thread Ryuuichi KUMAI
Hello Qingdi, Have you changed the "" setting in solrconfig.xml? In my experience, when using compound-file index ("true"), the size of index grows up to triple during optimization. My understanding is that when writing a new segment in compound format, Lucene writes the multifile format first and

Re: Index size vs. number of documents

2008-08-15 Thread Otis Gospodnetic
apache.org > Sent: Friday, August 15, 2008 12:22:30 PM > Subject: Re: Index size vs. number of documents > > By "Index size almost never grows linearly with the number of > documents" are you saying it increases more slowly that the number of > documents, i.e. sub-line

Re: Index size vs. number of documents

2008-08-15 Thread Phillip Farber
By "Index size almost never grows linearly with the number of documents" are you saying it increases more slowly that the number of documents, i.e. sub-linearly or more rapidly? With dirty OCR the number of unique terms is always increasing due to the garbage "words" -Phil Chris Hostetter w

Re: Index size vs. number of documents

2008-08-14 Thread Chris Hostetter
: > I'm surprised, as you are, by the non-linearity. Out of curiosity, what is Unless the data in "stored" fields is significantly greater then "indexed" fields the Index size almost never grows linearly with the number of documents -- it's the number of unique terms that tends to primarily in

Re: Index size vs. number of documents

2008-08-14 Thread Phillip Farber
Erick Erickson wrote: I'm surprised, as you are, by the non-linearity. Out of curiosity, what is your MaxFieldLength? By default only the first 10,000 tokens are added to a field per document. If you haven't set this higher, that could account for it. We set it to a very large number so we in

Re: Index size vs. number of documents

2008-08-13 Thread Erick Erickson
I'm surprised, as you are, by the non-linearity. Out of curiosity, what is your MaxFieldLength? By default only the first 10,000 tokens are added to a field per document. If you haven't set this higher, that could account for it. As far as I know, optimization shouldn't really affect the index siz

Re: index size

2007-10-11 Thread Kevin Lewandowski
> To achieve this I have to keep the document field to "stored" right? Yes, the field needs to be stored to return snippets. > When I do this my index becomes huge 10 GB index, cause I have 10K > docs but each is very lengthy HTML. Is there any better solution? > Why is index created by nutch s

Re: index size

2007-10-11 Thread Ravish Bhagdev
Hi All, I'm facing similar problem. I want to index entire document as a field. But I also want to be able to retrieve snippets (like Google/Nutch return in results page below the links). To achieve this I have to keep the document field to "stored" right? When I do this my index becomes huge 1

Re: index size

2007-10-09 Thread Kevin Lewandowski
Late reply on this but I just wanted to say thanks for the suggestions. I went through my whole schema and was storing things that didn't need to be stored and indexing a lot of things that didn't need to be indexed. Just completed a full reindex and it's a much more reasonable size now. Kevin On

Re: index size

2007-08-20 Thread Mike Klaas
On 17-Aug-07, at 2:03 PM, Kevin Lewandowski wrote: Are there any tips on reducing the index size or what factors most impact index size? My index has 2.7 million documents and is 200 gigabytes and growing. Most documents are around 2-3kb and there are about 30 indexed fields. An "ls -sh" wil

Re: index size

2007-08-17 Thread Chris Hostetter
: - make sure that you only store fields you need to retrieve... if you : only need to search on the fields, make them indexed-only. and omitNorms on any fields were you don't need lengthNormilization or field boosts (ie: date fields, numeric fields, boolean flag fields, etc...) -Hoss

Re: index size

2007-08-17 Thread Yonik Seeley
On 8/17/07, Kevin Lewandowski <[EMAIL PROTECTED]> wrote: > Are there any tips on reducing the index size or what factors most > impact index size? > > My index has 2.7 million documents and is 200 gigabytes and growing. > Most documents are around 2-3kb and there are about 30 indexed fields. Wow,