IndexReaders cannot exceed 2 Billion

2017-08-07 Thread Wael Kader
Hello,

I faced an issue that is making me go crazy.
I am running SOLR saving data on HDFS and I have a single node setup with
an index that has been running fine until today.
I know that 2 billion documents is too much on a single node but it has
been running fine for my requirements and it was pretty fast.

I restarted SOLR today and I am getting an error stating "Too many
documents, composite IndexReaders cannot exceed 2147483519.
The last backup I have is 2 weeks back and I really need the index to start
to get the data from the index.

Please help !
-- 
Regards,
Wael


IndexReaders cannot exceed 2 Billion

2017-08-08 Thread Wael Kader
> 
> Hello,
> 
> I am facing an issue on my live environment and I couldn’t find a solution 
> yet.
> I am running SOLR saving data on HDFS and I have a single node setup with an 
> index that has been running fine until today. 
> I know that 2 billion documents is too much on a single node but it has been 
> running fine for my requirements and it was pretty fast.
> 
> I restarted SOLR today and I am getting an error stating "Too many documents, 
> composite IndexReaders cannot exceed 2147483519.
> The last backup I have is 2 weeks back and I really need the index to start 
> to get the data from the index. I can delete data and create a separate shard 
> but I need it to be up so I can take the data.
> 
> Please help !
> -- 
> Regards,
> Wael



Could not find configName error

2017-09-04 Thread Wael Kader
Hi,

I had some issues in SOLR shutting down on a single node application on
Hadoop.

After starting up i got the error:
Could not find configName for collection XXX found.

I know the issue is that the configs has issues in Zookeeper but I would
like to know how I can push this configuration back to get the index
running.

-- 
Regards,
Wael


Re: Could not find configName error

2017-09-05 Thread Wael Kader
i am using  SOLR  4.10.3

I am not sure I have them in source control. I don't actually know what
that is.
I am using SOLR on a pre-setup VM.

On Tue, Sep 5, 2017 at 5:26 PM, Erick Erickson 
wrote:

> What version of Solr?
>
> bin/solr zk -help
>
> In particular upconfig can be used to move configsets up to Zookeeper
> (or back down or whatever) in relatively recent versions of Solr. Yo
> are keeping them in source control right? ;)
>
> Best,
> Erick
>
> On Mon, Sep 4, 2017 at 11:27 PM, Wael Kader  wrote:
> > Hi,
> >
> > I had some issues in SOLR shutting down on a single node application on
> > Hadoop.
> >
> > After starting up i got the error:
> > Could not find configName for collection XXX found.
> >
> > I know the issue is that the configs has issues in Zookeeper but I would
> > like to know how I can push this configuration back to get the index
> > running.
> >
> > --
> > Regards,
> > Wael
>



-- 
Regards,
Wael


Faceting Word Count

2017-11-05 Thread Wael Kader
Hello,

I am having an index with around 100 Million documents.
I have a multivalued column that I am saving big chunks of text data in. It
has around 20 GB of RAM and 4 CPU's.

I was doing faceting on it to get word cloud but it was taking around 1
second to retrieve when the data was 5-10 Million .
Now I have more data and its taking minutes to get the results (that is if
it gets it and SOLR doesn't crash). Whats the best way to make it run or
maybe its not scalable to make it run on my current schema and design with
News articles.

I am looking to find the best solution for this. Maybe create another index
to split the data while inserting it or maybe if I change some settings in
SolrConfig or add some RAM, it would perform better.

-- 
Regards,
Wael


Re: Faceting Word Count

2017-11-06 Thread Wael Kader
Hi,

I am using a custom field. Below is the field definition.
I am using this because I don't want stemming.



  








  
  










  



Regards,
Wael

On Mon, Nov 6, 2017 at 10:29 AM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Wael,
> Can you provide your field definition and sample query.
>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 6 Nov 2017, at 08:30, Wael Kader  wrote:
> >
> > Hello,
> >
> > I am having an index with around 100 Million documents.
> > I have a multivalued column that I am saving big chunks of text data in.
> It
> > has around 20 GB of RAM and 4 CPU's.
> >
> > I was doing faceting on it to get word cloud but it was taking around 1
> > second to retrieve when the data was 5-10 Million .
> > Now I have more data and its taking minutes to get the results (that is
> if
> > it gets it and SOLR doesn't crash). Whats the best way to make it run or
> > maybe its not scalable to make it run on my current schema and design
> with
> > News articles.
> >
> > I am looking to find the best solution for this. Maybe create another
> index
> > to split the data while inserting it or maybe if I change some settings
> in
> > SolrConfig or add some RAM, it would perform better.
> >
> > --
> > Regards,
> > Wael
>
>


-- 
Regards,
Wael


Re: Faceting Word Count

2017-11-07 Thread Wael Kader
Hi,

The whole index has 100M but when I add the criteria, it will filter the
data to maybe 10k as a max number of rows.
The facet isn't working when the total number of records in the index is
100M but it was working at 5M.

I have social media & RSS data in the index and I am trying to get the word
count for a specific user on specific date intervals.

Regards,
Wael

On Mon, Nov 6, 2017 at 3:42 PM, Erick Erickson 
wrote:

> _Why_ do you want to get the word counts? Faceting on all of the
> tokens for 100M docs isn't something Solr is ordinarily used for. As
> Emir says it'll take a huge amount of memory. You can use one of the
> function queries (termfreq IIRC) that will give you the count of any
> individual term you have and will be very fast.
>
> But getting all of the word counts in the index is probably not
> something I'd use Solr for.
>
> This may be an XY problem, you're asking how to do something specific
> (X) without explaining what the problem you're trying to solve is (Y).
> Perhaps there's another way to accomplish (Y) if we knew more about
> what it is.
>
> Best,
> Erick
>
>
>
> On Mon, Nov 6, 2017 at 4:15 AM, Emir Arnautović
>  wrote:
> > Hi Wael,
> > You are faceting on analyzed field. This results in field being
> uninverted - fieldValueCache being built - on first call after every
> commit. This is both time and memory consuming (you can check in admin
> console in stats how much memory it took).
> > What you need to do is to create multivalue string field (not text) and
> parse values (do analysis steps) on client side and store it like that.
> This will allow you to enable docValues on that field and avoid building
> fieldValueCache.
> >
> > HTH,
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection
> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> >> On 6 Nov 2017, at 13:06, Wael Kader  wrote:
> >>
> >> Hi,
> >>
> >> I am using a custom field. Below is the field definition.
> >> I am using this because I don't want stemming.
> >>
> >>
> >> >> positionIncrementGap="100">
> >>  
> >> >> mapping="mapping-ISOLatin1Accent.txt"/>
> >>
> >>
> >> >>ignoreCase="true"
> >>words="stopwords.txt"
> >>enablePositionIncrements="true"
> >>/>
> >> >>protected="protwords.txt"
> >>generateWordParts="0"
> >>generateNumberParts="1"
> >>catenateWords="1"
> >>catenateNumbers="1"
> >>catenateAll="0"
> >>splitOnCaseChange="1"
> >>preserveOriginal="1"/>
> >>
> >>
> >>
> >>  
> >>  
> >> >> mapping="mapping-ISOLatin1Accent.txt"/>
> >>
> >> synonyms="synonyms.txt"
> >> ignoreCase="true" expand="true"/>
> >> >>ignoreCase="true"
> >>words="stopwords.txt"
> >>enablePositionIncrements="true"
> >>/>
> >> 
> >>     >>protected="protwords.txt"
> >>generateWordParts="0"
> >>catenateWords="0"
> >>catenateNumbers="0"
> >>catenateAll="0"
> >>splitOnCaseChange="1"
> >>preserveOriginal="1"/>
> >>
> >>
> >>
> >>
> >>  
> >>
> >>
> >>
> >> Regards,
> >> Wael
> >>
> >> On Mon, Nov 6, 2017 at 10:29 AM, Emir Arnautović <
> >> emir.arnauto...@sematext.com> wrote:
> >>
> >>> Hi Wael,
> >>> Can you provide your field definition and sample query.
> >>>
> >>> Thanks,
> >>> Emir
> >>> --
> >>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>> Solr & Elasticsearch Consulting Support Training -
> http://sematext.com/
> >>>
> >>>
> >>>
> >>>> On 6 Nov 2017, at 08:30, Wael Kader  wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> I am having an index with around 100 Million documents.
> >>>> I have a multivalued column that I am saving big chunks of text data
> in.
> >>> It
> >>>> has around 20 GB of RAM and 4 CPU's.
> >>>>
> >>>> I was doing faceting on it to get word cloud but it was taking around
> 1
> >>>> second to retrieve when the data was 5-10 Million .
> >>>> Now I have more data and its taking minutes to get the results (that
> is
> >>> if
> >>>> it gets it and SOLR doesn't crash). Whats the best way to make it run
> or
> >>>> maybe its not scalable to make it run on my current schema and design
> >>> with
> >>>> News articles.
> >>>>
> >>>> I am looking to find the best solution for this. Maybe create another
> >>> index
> >>>> to split the data while inserting it or maybe if I change some
> settings
> >>> in
> >>>> SolrConfig or add some RAM, it would perform better.
> >>>>
> >>>> --
> >>>> Regards,
> >>>> Wael
> >>>
> >>>
> >>
> >>
> >> --
> >> Regards,
> >> Wael
> >
>



-- 
Regards,
Wael


Re: Faceting Word Count

2017-11-08 Thread Wael Kader
Hi,

I want to know the best option for getting word cloud in SOLR.
Is it saving the data as multivalued, using vector, JSON faceting(didn't
work with me)? Terms doesn't work because I can't provide any criteria.

I don't mind changing the design but I need to know the best feasible way
that won't make any problems on the long run.
I want to be able to get the word frequency based on a criteria. Facets are
taking around 1 minute to return data now.

Regards,
Wael

On Wed, Nov 8, 2017 at 11:06 AM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Wael,
> You can try out JSON faceting - it’s not just about rq/resp format, but it
> uses different implementation as well. In any case you will have to index
> documents differently in order to be able to use docValues.
>
> HTH
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 7 Nov 2017, at 09:26, Wael Kader  wrote:
> >
> > Hi,
> >
> > The whole index has 100M but when I add the criteria, it will filter the
> > data to maybe 10k as a max number of rows.
> > The facet isn't working when the total number of records in the index is
> > 100M but it was working at 5M.
> >
> > I have social media & RSS data in the index and I am trying to get the
> word
> > count for a specific user on specific date intervals.
> >
> > Regards,
> > Wael
> >
> > On Mon, Nov 6, 2017 at 3:42 PM, Erick Erickson 
> > wrote:
> >
> >> _Why_ do you want to get the word counts? Faceting on all of the
> >> tokens for 100M docs isn't something Solr is ordinarily used for. As
> >> Emir says it'll take a huge amount of memory. You can use one of the
> >> function queries (termfreq IIRC) that will give you the count of any
> >> individual term you have and will be very fast.
> >>
> >> But getting all of the word counts in the index is probably not
> >> something I'd use Solr for.
> >>
> >> This may be an XY problem, you're asking how to do something specific
> >> (X) without explaining what the problem you're trying to solve is (Y).
> >> Perhaps there's another way to accomplish (Y) if we knew more about
> >> what it is.
> >>
> >> Best,
> >> Erick
> >>
> >>
> >>
> >> On Mon, Nov 6, 2017 at 4:15 AM, Emir Arnautović
> >>  wrote:
> >>> Hi Wael,
> >>> You are faceting on analyzed field. This results in field being
> >> uninverted - fieldValueCache being built - on first call after every
> >> commit. This is both time and memory consuming (you can check in admin
> >> console in stats how much memory it took).
> >>> What you need to do is to create multivalue string field (not text) and
> >> parse values (do analysis steps) on client side and store it like that.
> >> This will allow you to enable docValues on that field and avoid building
> >> fieldValueCache.
> >>>
> >>> HTH,
> >>> Emir
> >>> --
> >>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>> Solr & Elasticsearch Consulting Support Training -
> http://sematext.com/
> >>>
> >>>
> >>>
> >>>> On 6 Nov 2017, at 13:06, Wael Kader  wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> I am using a custom field. Below is the field definition.
> >>>> I am using this because I don't want stemming.
> >>>>
> >>>>
> >>>>>>>> positionIncrementGap="100">
> >>>> 
> >>>>>>>> mapping="mapping-ISOLatin1Accent.txt"/>
> >>>>   
> >>>>
> >>>>>>>>   ignoreCase="true"
> >>>>   words="stopwords.txt"
> >>>>   enablePositionIncrements="true"
> >>>>   />
> >>>>>>>>   protected="protwords.txt"
> >>>>   generateWordParts="0"
> >>>>   generateNumberParts="1"
> >>>>   catenateWords="1"
> >>>>   catenateNumbers="1"
> >>>>   catenateAll="0"
> >>>>   splitOnCaseChan

SOLR Data Backup

2018-01-18 Thread Wael Kader
Hello,

Whats the best way to do a backup of the SOLR data.
I have a single node solr server and I want to always keep a copy of the
data I have.

Is replication an option for what I want ?

I would like to get some tutorials and papers if possible on the method
that should be used in case its backup or replication or anything else.

-- 
Regards,
Wael


Re: SOLR Data Backup

2018-01-18 Thread Wael Kader
Hi,

Its not possible for me to re-index the data in some of my indexes is only
saved in SOLR.
I need this solution to make sure that in case the live index fails, I can
move to the backup or replicated index.

Thanks,
Wael

On Thu, Jan 18, 2018 at 11:41 AM, Charlie Hull  wrote:

> On 18/01/2018 09:21, Wael Kader wrote:
>
>> Hello,
>>
>> Whats the best way to do a backup of the SOLR data.
>> I have a single node solr server and I want to always keep a copy of the
>> data I have.
>>
>> Is replication an option for what I want ?
>>
>> I would like to get some tutorials and papers if possible on the method
>> that should be used in case its backup or replication or anything else.
>>
>>
> Hi Wael,
>
> Have you considered backing up the source data instead? You can always
> re-index to re-create the Solr data.
>
> Replication will certainly allow you to maintain a copy of the Solr data,
> either so you can handle more search traffic by load balancing between the
> two, or to provide a failover capability in the case of a server failure.
> But this isn't a backup in the traditional sense. You shouldn't consider
> Solr as your 'source of truth' unless for some reason it is impossible to
> re-index.
>
> Perhaps if you could let us know why you think you need a backup we can
> suggest the best solution.
>
> Cheers
>
> Charlie
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>



-- 
Regards,
Wael


Re: SOLR Data Backup

2018-01-18 Thread Wael Kader
Hi,

The data is always changing for me so I think I can try the replication
option.
I am using cloudera and the data is saved in HDFS. Is it possible for me to
move the data while the index is running without any problems ?

I would also like to know if its possible to setup slave/master replication
without rebuilding the index.

Thanks,
Wael

On Thu, Jan 18, 2018 at 12:06 PM, Wael Kader  wrote:

> Hi,
>
> Its not possible for me to re-index the data in some of my indexes is only
> saved in SOLR.
> I need this solution to make sure that in case the live index fails, I can
> move to the backup or replicated index.
>
> Thanks,
> Wael
>
> On Thu, Jan 18, 2018 at 11:41 AM, Charlie Hull  wrote:
>
>> On 18/01/2018 09:21, Wael Kader wrote:
>>
>>> Hello,
>>>
>>> Whats the best way to do a backup of the SOLR data.
>>> I have a single node solr server and I want to always keep a copy of the
>>> data I have.
>>>
>>> Is replication an option for what I want ?
>>>
>>> I would like to get some tutorials and papers if possible on the method
>>> that should be used in case its backup or replication or anything else.
>>>
>>>
>> Hi Wael,
>>
>> Have you considered backing up the source data instead? You can always
>> re-index to re-create the Solr data.
>>
>> Replication will certainly allow you to maintain a copy of the Solr data,
>> either so you can handle more search traffic by load balancing between the
>> two, or to provide a failover capability in the case of a server failure.
>> But this isn't a backup in the traditional sense. You shouldn't consider
>> Solr as your 'source of truth' unless for some reason it is impossible to
>> re-index.
>>
>> Perhaps if you could let us know why you think you need a backup we can
>> suggest the best solution.
>>
>> Cheers
>>
>> Charlie
>>
>> --
>> Charlie Hull
>> Flax - Open Source Enterprise Search
>>
>> tel/fax: +44 (0)8700 118334
>> mobile:  +44 (0)7767 825828
>> web: www.flax.co.uk
>>
>
>
>
> --
> Regards,
> Wael
>



-- 
Regards,
Wael


Solr Recommended setup

2018-02-14 Thread Wael Kader
Hi,

I would like to get a recommendation for the SOLR setup I have.

I have an index getting around 2 Million records per day. The index used is
in Cloudera Search (Solr).
I am running everything on one node. I run SOLR commits for whatever data
that comes to the index every 5 minutes.
The whole Cloudera VM has 64 GB of Ram.

Its working fine till now having around 80 Million records but Solr gets
slow once a week so I restart the VM for things to work.
I would like to get a recommendation on the setup. Note that I can add VM's
for my setup if needed.
I read somewhere that its wrong to index and read data from the same place.
I am doing this now and I do know I am doing things wrong.
How can I do a setup on Cloudera for SOLR to do indexing in one VM and do
the reading on another and what recommendations should I do for my setup.


-- 
Regards,
Wael


Solr crashing StandardWrapperValve

2018-02-27 Thread Wael Kader
Hello,

SOLR kept crashing today over and over again .
I am running a single node solr instance on Cloudera with 140 GB of data.
Things were working fine until today. I have a replication server that I am
replicating data to but it wasn't working before and was fixed today.. so I
thought maybe its causing the issue so I stopped the replication.
I am not sure this is the problem as it crashed once after I stopped the
replication. I need help on identifying the problem.

I tried to find the problem from the log and I found the below error:

Feb 27, 2018 6:23:14 AM org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet default threw exception
java.lang.IllegalStateException
at
org.apache.catalina.connector.ResponseFacade.sendError(ResponseFacade.java:407)
at
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:962)
at
org.apache.solr.servlet.SolrDispatchFilter.httpSolrCall(SolrDispatchFilter.java:497)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:255)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.solr.servlet.SolrHadoopAuthenticationFilter$2.doFilter(SolrHadoopAuthenticationFilter.java:408)
at
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:622)
at
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:301)
at
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:574)
at
org.apache.solr.servlet.SolrHadoopAuthenticationFilter.doFilter(SolrHadoopAuthenticationFilter.java:413)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:861)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:612)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:503)
at java.lang.Thread.run(Thread.java:745)

-- 
Regards,
Wael


Move SOLR from cloudera HDFS to SOLR on Docker

2019-12-18 Thread Wael Kader
Hello,

I want to move data from my SOLR setup on Cloudera Hadoop to a docker SOLR
container.
I don't need to run all the hadoop services in my setup as I am only
currently using SOLR from the cloudera HDP.

My concern now is to know what's the best way to move the data and schema
to Docker container.
I don't mind moving data to an older version of SOLR Container to match the
4.10.3 SOLR Version I have on Cloudera.

Much help is appreciated.

-- 
Regards,
Wael