problem with index size

2015-07-22 Thread Daniel Holmes
Hi All
I have problem with index size in solr 4.7.2. My OS is Ubuntu 14.10 64-bit.
my fields are :













In one case for instance my segments size is 8.4G while index size is
28G!!! It seems unusual...

What suggestions do you have to reduce index size?
Is there any way to check disk usage details in cores? e.g. stop words,
stored docs, etc.


Re: problem with index size

2015-07-22 Thread Daniel Holmes
Upayavira number of docs in that case is 140275. The solr memory is 30Gb.

Yes Emir I need most of them to be saved.

I don't know Alessandro is that usual to use disk for indexing more than 3x
of document size and presumably it will grow up in continue of crawl
exponentially... Its so suboptimal I think.


On Wed, Jul 22, 2015 at 3:16 PM, Alessandro Benedetti <
benedetti.ale...@gmail.com> wrote:

> "In one case for instance my segments size is 8.4G while index size is
> 28G!!! It seems unusual…"
>
> The index is a collection of index segments + few overhead .
> So, do you simply mean  you have 4 segments ?
> Where is the problem anyway ?
> You are also storing content which usually is a big part of the index.
> As Upaya said, I am curious to know why you are so surprised !
>
> Cheers
>
> 2015-07-22 11:27 GMT+01:00 Daniel Holmes :
>
> > Hi All
> > I have problem with index size in solr 4.7.2. My OS is Ubuntu 14.10
> 64-bit.
> > my fields are :
> >
> > 
> > 
> >  > required="true"/>
> >  > required="true"/>
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> >
> > In one case for instance my segments size is 8.4G while index size is
> > 28G!!! It seems unusual...
> >
> > What suggestions do you have to reduce index size?
> > Is there any way to check disk usage details in cores? e.g. stop words,
> > stored docs, etc.
> >
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>


What kind of nutch documents does Solr index?

2015-09-28 Thread Daniel Holmes
Hi,
I am using apache Nutch 1.7 to crawl and apache Solr 4.7.2 for indexing. In
my tests there is a gap between number of fetched results of Nutch and
number of indexed documents in Solr. For example one of the crawls is
fetched 23343 pages and 1146 images successfully while in the Solr 19250
docs is indexed and 500 of them is image urls.

My question is that what kind of pages are indexed is solr and why?
Does Solr index pages whit other status or not?
what kind of images does Solr index?

Thanks.


Re: What kind of nutch documents does Solr index?

2015-09-30 Thread Daniel Holmes
Thank you Upayavira for your anser. In the case I described maxDoc is 19263.
As I check the Nutch, default indexing filter in Nutch is basic indexing
filter and also it have a property to delete gone and permanently
redirected pages which it value was false for me.
I think the problem is still remained for solr.


On Mon, Sep 28, 2015 at 3:03 PM, Upayavira  wrote:

> I suspect you may be better off asking this on the Nutch user list. The
> decisions you are describing will be within the Nutch codebase, not
> Solr. Someone here may know (hopefully) but you may get more support
> over on the Nutch list.
>
> One suggestion -start with a clean, empty index. Run a crawl. Look at
> the maxDocs vs numDocs (visible via the admin UI for your
> core/collection). If maxDocs>numDocs, it means that some docs have been
> overwritten - i.e. the ID field that Nutch is using is not unique.
>
> Upayavira
>
> On Mon, Sep 28, 2015, at 10:19 AM, Daniel Holmes wrote:
> > Hi,
> > I am using apache Nutch 1.7 to crawl and apache Solr 4.7.2 for indexing.
> > In
> > my tests there is a gap between number of fetched results of Nutch and
> > number of indexed documents in Solr. For example one of the crawls is
> > fetched 23343 pages and 1146 images successfully while in the Solr 19250
> > docs is indexed and 500 of them is image urls.
> >
> > My question is that what kind of pages are indexed is solr and why?
> > Does Solr index pages whit other status or not?
> > what kind of images does Solr index?
> >
> > Thanks.
>