Re: Explode kind of function in Solr

2018-09-14 Thread Rahul Singh
https://github.com/bazaarvoice/jolt On Thu, Sep 13, 2018 at 9:18 AM Joel Bernstein wrote: > Solr Streaming Expressions allow you to do this with the cartesianProduct > function: > > > http://lucene.apache.org/solr/guide/7_4/stream-decorator-reference.html#cartesianproduct > > The structure of th

Re: 20180913 - Clarification about Limitation

2018-09-13 Thread Rahul Singh
Depends on whether you are using Solr or solrcloud. Solrcloud distributes data into shards so it increases overall capacity. Rahul Singh Chief Executive Officer m 202.905.2818 Anant Corporation 1010 Wisconsin Ave NW, Suite 250 Washington, D.C. 20007 We build and manage digital business

Re: parent/child rows in solr

2018-09-13 Thread Rahul Singh
waste of space. Rahul Singh Chief Executive Officer m 202.905.2818 Anant Corporation 1010 Wisconsin Ave NW, Suite 250 Washington, D.C. 20007 We build and manage digital business technology platforms. On Sep 11, 2018, 11:23 PM -0400, John Smith , wrote: > On Tue, Sep 11, 2018 at 11:05 PM Wal

Re: Boost only first 10 records

2018-09-03 Thread Rahul Singh
” query. Rahul Singh Chief Executive Officer m 202.905.2818 Anant Corporation 1010 Wisconsin Ave NW, Suite 250 Washington, D.C. 20007 We build and manage digital business technology platforms. On Sep 3, 2018, 6:29 AM -0400, Emir Arnautović , wrote: > Hi, > The requirement is not 100% cl

Re: Metrics for a healthy Solr cluster

2018-08-17 Thread Rahul Singh
I wrote something related to this topic a while ago. https://www.google.com/amp/s/blog.anant.us/resources-for-monitoring-datastax-cassandra-spark-solr-performance/amp/ Rahul On Aug 16, 2018, 3:35 PM -0700, Jan Høydahl , wrote: > Check out the Reference Guide chapter on monitoring with open source

Re: Recipe for moving to solr cloud without reindexing

2018-08-07 Thread Rahul Singh
Bjarke, I am imagining that at some point you may need to shard that data if it grows. Or do you imagine this data to remain stagnant? Generally you want to add solrcloud to do two things : 1. Increase availability with replicas 2. Increase available data via shards 3. Increase fault tolerance

Re: Silk from LucidWorks

2018-07-15 Thread Rahul Singh
Their commercial offering still has something like it. You can always try Grafana Rahul On Jul 13, 2018, 9:59 AM -0400, rgummadi , wrote: > Is SiLK from LucidWorks still an acitve project. I looked at their github and > it does not seem to be active. If so are there any alternative solutions. > >

Re: Text Similarity

2018-07-15 Thread Rahul Singh
How do you define similarity? There are various different methods that work for different methods. In solr depending on which index time analyzer / tokenizer you are using, it will treat one company name as similar in one scenario and not in another. This seems like a case of data deduplication

Re: Delta import not working with Oracle in Solr

2018-07-10 Thread Rahul Singh
Agreed. DIH is not an industrial grade ETL tool.. may want to consider other options. May want to look into Kafka Connect as an alternative. It has connectors for JDBC into Kafka, and from Kafka into Solr. -- Rahul Singh rahul.si...@anant.us Anant Corporation On Jul 9, 2018, 6:14 AM -0500

Re: How to know the name(url) of documents that data import handler skipped

2018-07-08 Thread Rahul Singh
Have you tried changing the log level https://lucene.apache.org/solr/guide/7_2/configuring-logging.html -- Rahul Singh rahul.si...@anant.us Anant Corporation On Jul 8, 2018, 8:54 PM -0500, Yasufumi Mizoguchi , wrote: > Hi, > > I am trying to indexing files into Solr 7.2 using da

Resources for Monitoring Cassandra, Spark, Solr

2018-07-02 Thread Rahul Singh
is a work in progress and I'll update this with screenshots as well as with links from other contributors. -- Rahul Singh rahul.si...@anant.us Anant Corporation

Re: Drive Change for Solr Setup

2018-06-21 Thread Rahul Singh
If it’s windows it may be using a tool called NSSM to manage the solr service. Look at windows services and task scheduler and understand if solr services are being managed by windows via services or the task scheduler — or just .batch files. Rahul On Jun 20, 2018, 11:34 AM -0400, Shawn Heisey

Re: Solr Cloud 7.3.1 backups

2018-05-31 Thread Rahul Singh
are some decent distributed shared file system services that could be leveraged depending on the number of compute nodes. Shared file system is the best way to keep it consistent but it comes with its draw backs. You can always backup locally and asynchronously sync to shared FS too. -- Rahul

Re: How to do parallel indexing on files (not on HDFS)

2018-05-24 Thread Rahul Singh
Right, That’s why you need a place to persist the task list / graph. If you use a table, you can set “processed” / “unprocessed” value … or a queue, then its delivered only once .. otherwise you have to check indexed date from solr, and waste a solr call. -- Rahul Singh rahul.si...@anant.us

Re: How to do parallel indexing on files (not on HDFS)

2018-05-24 Thread Rahul Singh
aring more from you or anyone in this > Solr community. > > > > Sincerely yours, > > > Raymond > > > On Wed, May 23, 2018 at 8:15 AM, Rahul Singh > > wrote: > > > Enumerate the file locations (map

Re: How to do parallel indexing on files (not on HDFS)

2018-05-23 Thread Rahul Singh
Enumerate the file locations (map) , put them in a queue like rabbit or Kafka (Persist the map), have a bunch of threads , workers, containers, whatever pop off the queue , process the item (reduce). -- Rahul Singh rahul.si...@anant.us Anant Corporation On May 20, 2018, 7:24 AM -0400

Re: Multi threading indexing

2018-05-16 Thread Rahul Singh
Can try to leverage Spark to index. Or Kafka Connect with SolR. -- Rahul Singh rahul.si...@anant.us Anant Corporation On May 14, 2018, 2:03 AM -0500, Mikhail Khludnev , wrote: > A few years ago I provided server side concurrency "booster" > https://issues.apache.org/jira/browse/

Re: SolrCloud

2018-05-16 Thread Rahul Singh
Having concurrent DIH for example from the same source on different cluster nodes may cause duplicate work. But yes the ZK is what distributes the conf. -- Rahul Singh rahul.si...@anant.us Anant Corporation On May 16, 2018, 4:55 AM -0500, Jon Morisi , wrote: > Hi All, > I'm

Re: Apache SOLR Design Query

2018-05-13 Thread Rahul Singh
. 4. Unless you need highlighting, only index the actual contents, and store the rest of the fields. 5. Shared File storage is probably ok, but you may want to do with a caching later via Nginx and serve files through it. That way you don’t hit the disk every time. -- Rahul Singh rahul.si

Re: Team please help

2018-04-29 Thread Rahul Singh
pipeline. Best, -- Rahul Singh rahul.si...@anant.us Anant Corporation On Apr 29, 2018, 6:27 AM -0700, Doug Turnbull , wrote: > Morphlines is a cloudera specific tool. I suspect moving Solr platforms > will require you to rework your indexing somewhat. You may need to step > back and think

Re: solr cell: write entire file content binary to index along with metadata

2018-04-25 Thread Rahul Singh
process can improve the overall stability of the SolR service. -- Rahul Singh rahul.si...@anant.us Anant Corporation On Apr 25, 2018, 12:49 PM -0400, Shawn Heisey , wrote: > On 4/25/2018 4:02 AM, Lee Carroll wrote: > > *We don't recommend using solr-cell for production indexing.* >

Re: DIH with huge data

2018-04-12 Thread Rahul Singh
CSV -> Spark -> SolR https://github.com/lucidworks/spark-solr/blob/master/docs/examples/csv.adoc If speed is not an issue there are other methods. Spring Batch / Spring Data might have all the tools you need to get speed without Spark. -- Rahul Singh rahul.si...@anant.us Anant Corpo

Re: DIH with huge data

2018-04-12 Thread Rahul Singh
t by merging distinct RDBMS tables in using RDD? > > On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh wrote: > > > How much data and what is the database source? Spark is probably the > > fastest way. > > > > -- > > Rahul Singh > > rahul.si...@anant.us &

Re: DIH with huge data

2018-04-12 Thread Rahul Singh
How much data and what is the database source? Spark is probably the fastest way. -- Rahul Singh rahul.si...@anant.us Anant Corporation On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar , wrote: > Hi, > > We are using DIH with SortedMapBackedCache but as data size increases we > nee

Re: Text in images are not extracted and indexed to content

2018-04-10 Thread Rahul Singh
May need to extract outside SolR and index pure text with an external ingestion process. You have much more control over the Tika attributes and behaviors. -- Rahul Singh rahul.si...@anant.us Anant Corporation On Apr 9, 2018, 10:23 PM -0400, Zheng Lin Edwin Yeo , wrote: > Hi, > > Cu

Re: Using Solr to build a product matcher, with learning to rank

2018-03-29 Thread Rahul Singh
Maybe overthinking this. There is a “more like this” feature at basically does this. Give that a try before digging deeper into the LTR methods. It may be good enough for rock and roll. -- Rahul Singh rahul.si...@anant.us Anant Corporation On Mar 28, 2018, 12:25 PM -0400, Xavier Schepler

RE: Solr or Elasticsearch

2018-03-22 Thread Rahul Singh
because the updates / selects are fast. Ultimately I think SolR is like a 18 wheel tractor trailer and Elastic is like a uhaul trucks and you can chain a bunch of them up to do what SolR does. -- Rahul Singh rahul.si...@anant.us Anant Corporation On Mar 22, 2018, 9:04 AM -0500, Liu, Daphne

RE: Question liste solr

2018-03-20 Thread Rahul Singh
Parallel processing in any way will help, including Spark w/ a DFS like S3 or HDFS. Your three machines could end up being a bottleneck and you may need more nodes. On Mar 20, 2018, 2:36 AM -0500, LOPEZ-CORTES Mariano-ext , wrote: > CSV file is 5GB aprox. for 29 millions. > > As you say Christo

Re: Securying ONLY the web interface console

2018-03-19 Thread Rahul Singh
Use a proxy server that only gives access to the update / select handlers (URLs). Can do it with a numerous programming languages or with a simple proxy in nginx. The whole web server running SolR is not supposed to be out in the open. You are opening yourself up to too many issues. -- Rahul

Re: Looking for design ideas

2018-03-18 Thread Rahul Singh
may be more work but it’s more scalable. Go big or go home. ;) Hope it helps -- Rahul Singh rahul.si...@anant.us Anant Corporation On Mar 18, 2018, 11:14 AM -0400, Steven White , wrote: > Hi everyone, > > I have a design problem that i"m not sure how to solve best so I figured I &

Re: solr.war built from solr 4.7.2 not working

2015-05-07 Thread Rahul Singh
response inline. On Thu, May 7, 2015 at 7:01 PM, Shawn Heisey wrote: > On 5/7/2015 3:43 AM, Rahul Singh wrote: > > I have tried to deploy solr.war from building it from 4.7.2 but it is > > showing the below mentioned error. Has anyone faced the same? any lead > > woul

solr.war built from solr 4.7.2 not working

2015-05-07 Thread Rahul Singh
Hi, I have tried to deploy solr.war from building it from 4.7.2 but it is showing the below mentioned error. Has anyone faced the same? any lead would also be appreciated. Error Message: { "responseHeader": { "status": 500, "QTime": 33 }, "error": { "msg": "parsing error",

Re: ranking retrieval measure

2014-04-01 Thread Rahul Singh
one of the measurement criteria is DCG. http://en.wikipedia.org/wiki/Discounted_cumulative_gain On Tue, Apr 1, 2014 at 11:44 AM, Floyd Wu wrote: > Usually IR system is measured using Precision & Recall. > But depends on what kind of system you are developing to fit what scenario. > > Take a lo