iii) Wait for 5-10 seconds between each subsequent node start
Hope this helps.
Best,
Rahul
On Thu, Feb 11, 2021 at 12:03 PM mmb1234 wrote:
> Hello,
>
> On reboot of one of the solr nodes in the cluster, we often see a
> collection's shards with
> 1. LEADER replica in DO
t on underscores if that is your use case.
>
> On Sat, Jan 9, 2021 at 2:58 PM Rahul Goswami
> wrote:
>
> > Nope. The underscore is preserved right after tokenization even before it
> > reaches any filters. You can choose the type "text_general" and try an
&
Nope. The underscore is preserved right after tokenization even before it
reaches any filters. You can choose the type "text_general" and try an
index time analysis through the "Analysis" page on Solr Admin UI.
Thanks,
Rahul
On Sat, Jan 9, 2021 at 8:22 AM xiefengchan
this behavior is included in
the documentation since it is similar to the behavior with periods.
https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
"Periods (dots) that are not followed by whitespace are kept as part of the
token, including Internet domain names. "
Thanks,
Rahul
ation nevertheless.
https://backstage.forgerock.com/knowledge/kb/article/a39551500
The hex number the author talks about in the link above is the native
thread id.
Best,
Rahul
On Wed, Oct 14, 2020 at 8:00 AM Erick Erickson
wrote:
> Zisis makes good points. One other thing is I’d look to
>
updates.
Is this understanding correct ?
Thanks,
Rahul
On Wed, Oct 7, 2020 at 11:39 PM yaswanth kumar
wrote:
> Thank you very much both Eric and Shawn
>
> Sent from my iPhone
>
> > On Oct 7, 2020, at 10:41 PM, Shawn Heisey wrote:
> >
> > On 10/7/2020 4:40 PM, yaswant
l
3. How to scale up the servers for the better performance?
>> This is too open ended a question and depends on a lot of factors
specific to your environment and use-case :)
- Rahul
On Tue, Oct 6, 2020 at 4:26 PM Manisha Rahatadkar <
manisha.rahatad...@anjusoftware.com> wrote:
> Hi
Charlie,
Thanks for providing an alternate approach to doing this. It would be
interesting to know how one could go about organizing the docs in this
case? (Nested documents?) How would join queries perform on a large
index(200 million+ docs)?
Thanks,
Rahul
On Fri, Oct 2, 2020 at 5:55 AM
count-filter
You'll need to configure it in the schema for the "index" analyzer for the
data type of the field with large text.
Indexing documents of the order of half a GB will definitely come to hurt
your operations, if not now, later (think OOM, extremely slow atomic
updates, long r
Thanks for sharing this Anshum. Day 1 had some really interesting sessions.
Missed out on a couple that I would have liked to listen to. Are the
recordings of these sessions available anywhere?
-Rahul
On Mon, Sep 28, 2020 at 7:08 PM Anshum Gupta wrote:
> Hey everyone!
>
> ApacheCo
that I would still
expect delete by id to execute in reasonable time, so I would start by
looking at what is s eating up the CPU in your request.
-Rahul
On Sat, Sep 26, 2020 at 4:50 AM Goutham Tholpadi
wrote:
> Thanks Dominique! I just tried deleting a single document using its id. I
>
&
Goutham,
Is the field you are trying to delete by indexed=true in the schema ?
If the uniqueKey is indexed=true, does delete by id work for you?
( uniqueKey:value)
Also, instead of "Solr Command" if you choose the Document type as "XML"
does it make any difference?
Rahul
On
rect me if
I am wrong!)
-Rahul
On Thu, Sep 17, 2020 at 2:56 PM Rajdeep Sahoo
wrote:
> If someone is searching with " tshirt tshirt tshirt tshirt tshirt tshirt"
> we need to remove the duplicates and search with tshirt.
>
>
> On Fri, 18 Sep, 2020, 12:19 am Alexandre Rafalo
I agree with Phill, Noble and Ilan above. The problematic term is "slave"
(not master) which I am all for changing if it causes less regression than
removing BOTH master and slave. Since some people have pointed out Github
changing the "master" terminology, in my personal opinion, it was not a
meas
+1 on avoiding SolrCloud terminology. In the interest of keeping it obvious
and simple, may I I please suggest primary/secondary?
On Wed, Jun 17, 2020 at 5:14 PM Atita Arora wrote:
> I agree avoiding using of solr cloud terminology too.
>
> I may suggest going for "prime" and "clone"
> (Short an
) stored=false and docValues=true
3) stored=true and docValues=true
Thanks,
Rahul
On Tue, May 19, 2020 at 5:55 PM Erick Erickson
wrote:
> They are _absolutely_ able to be used together. Background:
>
> “In the bad old days”, there was no docValues. So whenever you needed
> to facet/so
Hoss,
Thank you for such a succinct explanation! I was not aware of the order of
lookups (queryResultCache followed by filterCache). Makes sense now. Sorry
for the false alarm!
Rahul
On Mon, Apr 20, 2020 at 4:04 PM Chris Hostetter
wrote:
> : 4) A query with different fq.
> :
quot;item_manu:samsung
manu:apple":"SortedIntDocSet{size=2,ramUsed=40 bytes}",
"warmupTime":0,
"maxRamMB":-1,
5) A query with the same fq again (fq=manu:samsung OR manu:apple)the
numbers don't get update for this fq hereafter for subseque
ted.
However, if I search with the same fq again, I expect the lookup and hits
count to increase, but it doesn't. This ultimately results in an incorrect
hitratio.
I tried this scenario on Solr 7.2.1, 7.7.2 and 8.5 and observe the same
behavior on all three versions.
Is this a bug or am I missing something here?
Thanks,
Rahul
eb 13, 2020 at 9:26 AM Erick Erickson
wrote:
> That should be OK. There were no code changes necessary for that upgrade.
> see SOLR-13363
>
> > On Feb 12, 2020, at 5:34 PM, Rahul Goswami
> wrote:
> >
> > Hello,
> > We are running a SolrCloud (7.2.1) cluster an
updates requests for a 2 node SolrCloud cluster with
the older (3.4.10) zookeeper and it seemed to work fine. But just want to
know if there are any caveats I should be aware of.
Thanks,
Rahul
Hello,
I am working with Solr 7.2.1 and had a question regarding the performance
of wildcard searches.
q=*:*
vs
q=id:*
vs
q=id:[* TO *]
Can someone please rank them in the order of performance with the
underlying reason?
Thanks,
Rahul
l documents and the
index size (to gather stats about the Solr server), is the amount of memory
consumed proportional to the index size in some way?
Thanks,
Rahul
On Wed, Jan 29, 2020 at 6:43 PM Shawn Heisey wrote:
> On 1/29/2020 3:01 PM, Rahul Goswami wrote:
> > 1) How expensive is c
Thanks for your response Walter. But I could not find a Java api for Luke
for writing my tool. Is there one? I also tried using the LukeRequestHandler
that comes with Solr, but invoking it causes the Solr core to be loaded.
Rahul
On Wed, Jan 29, 2020 at 5:20 PM Walter Underwood
wrote:
>
production
setup with above configuration?
Thanks,
Rahul
for better
application design considerations.
Thanks,
Rahul
s.
Is it linked appropriately? Or is it some access rights issue for non-PMC
members like me ?
Thanks,
Rahul
On Wed, Dec 4, 2019 at 7:12 AM Noble Paul wrote:
> Thanks ishan
>
> On Wed, Dec 4, 2019, 3:32 PM Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com>
> wrote:
>
&g
Hi Sujatha,
How did you upgrade your cluster ? Did you restart each node in the cluster
one by one after upgrade (while other nodes were running on 6.6.2) or did
you bring down the entire cluster and bring up one upgraded node at a time?
Thanks,
Rahul
On Thu, Nov 14, 2019 at 7:03 AM Paras
Hello,
Just wanted to follow up in case my question fell through the cracks :)
Would appreciate help on this.
Thanks,
Rahul
On Fri, Nov 15, 2019 at 5:32 PM Rahul Goswami wrote:
> Hello,
>
> We are planning to upgrade our SolrCloud cluster from 7.2.1 (hosted on
> Windows server)
n that case?
Thanks in advance!
Regards,
Rahul
any further custom processors other than the run update processor in
standalone mode? Alternatively, is there a way I can get a handle on a
complete document once it’s reconstructed from an atomic update?
Thanks,
Rahul
On Thu, Sep 19, 2019 at 7:06 AM Erick Erickson
wrote:
> _Why_ is reindex
the
processAdd() of the processor. Is this an expected behavior?
Regards,
Rahul
On Wed, Sep 18, 2019 at 5:28 PM Erick Erickson
wrote:
> It Depends (tm). This is a little confused. Why do you have
> distributed processor in stand-alone Solr? Stand-alone doesn't, well,
> distrib
don’t see any log lines from the processAdd() method.
Any inputs on why the processor is getting skipped if placed after
distributed processor?
Thanks,
Rahul
I am using SOLR version 6.6.0 and the heap size is set to 512 MB, I believe
which is default. We do have almost 10 million documents in the index, we do
perform frequent updates (we are doing soft commit on every update: heap issue
was seen with and without soft commit) to the index and obviousl
y one huge
document ?
2) If yes, does this flush create a segment with just one document ?
3) Heap dump analysis shows large (>350 MB) instances of
DocumentWritersPerThread. Does one instance of this class correspond to one
document?
Help is much appreciated.
Thanks,
Rahul
On Fri, Jul 5, 20
Shawn,Erick,
Thank you for the explanation. The merge scheduler params make sense now.
Thanks,
Rahul
On Wed, Jul 3, 2019 at 11:30 AM Erick Erickson
wrote:
> Two more tidbits to add to Shawn’s explanation:
>
> There are heuristics built in to ConcurrentMergeScheduler.
> From
iculty wrapping my head around this, and would appreciate if you could
help clear it for me.
Thanks,
Rahul
On Thu, Jun 13, 2019 at 7:33 AM Shawn Heisey wrote:
> On 6/6/2019 9:00 AM, Rahul Goswami wrote:
> > *OP Reply* : Total 48 GB per node... I couldn't see another software
> us
beefy physical servers at disposal for
this deployment. If we go with 4 SolrClouds then we would have 4x8=32 nodes
(Solr instances) running across these 4 physical servers.
Any issues that you might see with this configuration or additional
considerations that I might be missing?
Thanks,
Rahul
efficient for our use case
considering moderate-heavy indexing and search load? Would also like to
know the tradeoffs involved if any. Thanks in advance!
Regards,
Rahul
r this part is different on the master.
Regards,
Rahul
On Thu, Jun 20, 2019 at 8:22 PM Rahul Goswami wrote:
> Hi Gus,
> Thanks for the response and referencing the umbrella JIRA for these kind
> of issues. I see that it won't solve the problem since the builder object
> wh
binary to try
the patch nevertheless, but it didn't help as I anticipated. I'll update
the JIRA and submit a patch.
Thank you,
Rahul
On Thu, Jun 20, 2019 at 11:35 AM Gus Heck wrote:
> Hi Rahul,
>
> Did you try the patch int that issue? Also food for thought:
> https://is
teShardHandlerConfig().getDistributedSocketTimeout();
}
I found this open JIRA on this issue:
https://issues.apache.org/jira/browse/SOLR-12550?jql=text%20~%20%22distribUpdateSoTimeout%22
Should I update the JIRA with this ?
Thanks,
Rahul
On Thu, Jun 13, 2019 at 12:00 AM Rahul Goswami
wrote:
> Hello,
>
, is there a JIRA for it ?
Thanks,
Rahul
/measures.
Thanks,
Rahul
On Thu, Jun 6, 2019 at 11:00 AM Rahul Goswami wrote:
> Thank you for your responses. Please find additional details about the
> setup below:
>
> We are using Solr 7.2.1
>
> > I have a solrcloud setup on Windows server with below config:
> >
ndex.ConcurrentMergeScheduler",
"maxMergeCount":2,
"maxThreadCount":2},
Thanks,
Rahul
On Wed, Jun 5, 2019 at 4:24 PM Shawn Heisey wrote:
> On 6/5/2019 9:39 AM, Rahul Goswami wrote:
> > I have a solrcloud setup on Windows server with below config:
> >
that
this is the cause, and the timeouts and recoveries are the symptoms. Is my
understanding correct? If yes, what steps could I take to help the
situation. I do see that the difference between "Num Docs" and "Max Docs"
is about 20%.
Would appreciate your help.
Thanks,
Rahul
, since the parameters of this fq don't
change shouldn't I expect to gain any advantage out of using the
filterCache?
Thanks,
Rahul
On Wed, May 22, 2019 at 7:40 AM Toke Eskildsen wrote:
> On Wed, 2019-05-15 at 21:37 -0400, Rahul Goswami wrote:
> > fq={!graph from=from_field to=
on in Solr log files. I am thinking that seeing error in log files
doesn't hurt as long as the updates and get's work fine, but still would like
to know how to eradicate these errors from happening.
Thanks
Rahul Mandava
Hello experts,
Just following up in case my previous email got lost in the big stack of
queries. Would appreciate any help on optimizing a graph query. Or any
pointers on the direction to investigate.
Thanks,
Rahul
On Wed, May 15, 2019 at 9:37 PM Rahul Goswami wrote:
> Hello,
>
optimizations that I
could try?
Thanks,
Rahul
;ll continue to monitor this for now.
Thanks,
Rahul
On Fri, Mar 8, 2019 at 2:14 PM Erick Erickson
wrote:
> (1) no, and Shawn’s comments are well taken.
>
> (2) bq. is the number of segments would drastically increase
>
> Not true. First of all, TieredMergePolicy will take care of m
autoCommit interval (with openSearcher=false) is the number of
segments that would drastically increase, eventually causing merges,slower
searches etc.
Thanks,
Rahul
On Fri, Mar 8, 2019 at 12:08 PM Erick Erickson
wrote:
> Yes, you’ll get stale values. There’s no way I know of to change that,
>
1
On Thu, Mar 7, 2019 at 11:36 PM Zheng Lin Edwin Yeo
wrote:
> Hi,
>
> Do you mean that when you startup Solr, it will automatically do the search
> request even before the Solr is fully started up?
>
> Regards,
> Edwin
>
>
> On Fri, 8 Mar 2019 at 10:13, Rahul Goswami
results, which in
turn has a cascading effect on other parts of the application. Is there a
setting in Solr which would prevent Solr from serving search requests
before log replay has finished?
Thanks,
Rahul
in Solr
to know whether a replica is falling behind from the leader ?
Thanks,
Rahul
On Mon, Feb 11, 2019 at 10:28 PM Erick Erickson
wrote:
> bq. To answer your question about index size on
> disk, it is 3 TB on every node. As mentioned it's a 32 GB machine and I
> allocated 24G
our currentUpdates
Regards,
Rahul
On Thu, Feb 7, 2019 at 12:59 PM Erick Erickson
wrote:
> bq. We have a heavy indexing load of about 10,000 documents every 150
> seconds.
> Not so heavy query load.
>
> It's unlikely that changing numRecordsToKeep will help all that much if
> y
47C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46]
org.apache.solr.update.PeerSync PeerSync:
core=DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46 url=
http://indexnode1:2/solr too many updates received since start -
startingUpdates no longer overlaps with our currentUpdates
Thanks,
Rahul
created post split?
Regards,
Rahul
On Wed, Jan 30, 2019 at 1:18 AM Rahul Goswami wrote:
> Thanks for the reply Jan. I have been referring to documentation for
> SPLISHARD on 7.2.1
> <https://lucene.apache.org/solr/guide/7_2/collections-api.html#splitshard>
> which
> see
sc",fl="fileld1,field2,field3",qt="/export",q="*:*",fq="((field4:1)
OR (field4:2))",fq="{!collapse field=id_field sort='field3 desc'}")
The same query with "select" handler does return the collapse result fine.
Looks like this m
ink you need a
> screenshot here, what you describe is the default behaviour.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 28. jan. 2019 kl. 09:05 skrev Rahul Goswami :
> >
> > Hello,
> > I am using Solr 7.2.1. I c
mage.png]
Thanks,
Rahul
ve is coming from
documents not present in the same shard. I'll verify this tomorrow and
update the thread.
Thanks,
Rahul
On Mon, Jan 21, 2019 at 2:26 PM Joel Bernstein wrote:
> I haven't had time to look into the details of this issue but it's not
> clear that these two fea
Hello,
Following up on my query. I know this might be too specific an issue. But I
just want to know that it's a legitimate bug and the supported operation is
allowed with the /export handler. If someone has an idea about this and
could confirm, that would be great.
Thanks,
Rahul
On Thu, J
Hello,
I am using SolrCloud on Solr 7.2.1.
I get the NullPointerException in the Solr logs (in ExportWriter.java) when
the /stream handler is invoked with a search() streaming expression with
qt="/export" containing fq="{!collapse field=id_field sort="time desc"}
(among other fq's. I tried elimina
particularly
functional for any industry size load anyway.
Thanks,
Rahul
On Tue, Nov 20, 2018 at 3:37 AM Toke Eskildsen wrote:
> On Mon, 2018-11-19 at 22:19 -0500, Rahul Goswami wrote:
> > I am using SolrCloud 7.2.1. My understanding is that setting
> > docvalues=true would optimize fac
What is the Router name for your collection? Is it "implicit" (You can
know this from the "Overview" of you collection in the admin UI) ? If yes,
what is the router.field parameter the collection was created with?
Rahul
On Mon, Nov 19, 2018 at 11:19 PM Rajeswari Koll
What’s your update query?
You need to provide the unique id field of the document you are updating.
Rahul
On Mon, Nov 19, 2018 at 10:58 PM Rajeswari Kolluri <
rajeswari.koll...@oracle.com> wrote:
> Hi,
>
>
>
>
>
> Using Solr 7.5.0. While performing atomic upd
I am using SolrCloud 7.2.1. My understanding is that setting docvalues=true
would optimize faceting, grouping and sorting; but for a field to be
searchable it needs to be indexed=true. However I was dumbfounded today
when I executed a successful search on a field with below configuration:
However
https://github.com/bazaarvoice/jolt
On Thu, Sep 13, 2018 at 9:18 AM Joel Bernstein wrote:
> Solr Streaming Expressions allow you to do this with the cartesianProduct
> function:
>
>
> http://lucene.apache.org/solr/guide/7_4/stream-decorator-reference.html#cartesianproduct
>
> The structure of th
Depends on whether you are using Solr or solrcloud. Solrcloud distributes data
into shards so it increases overall capacity.
Rahul Singh
Chief Executive Officer
m 202.905.2818
Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007
We build and manage digital business
waste of space.
Rahul Singh
Chief Executive Officer
m 202.905.2818
Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007
We build and manage digital business technology platforms.
On Sep 11, 2018, 11:23 PM -0400, John Smith , wrote:
> On Tue, Sep 11, 2018 at 11:05 PM Wal
” query.
Rahul Singh
Chief Executive Officer
m 202.905.2818
Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007
We build and manage digital business technology platforms.
On Sep 3, 2018, 6:29 AM -0400, Emir Arnautović ,
wrote:
> Hi,
> The requirement is not 100% cl
I wrote something related to this topic a while ago.
https://www.google.com/amp/s/blog.anant.us/resources-for-monitoring-datastax-cassandra-spark-solr-performance/amp/
Rahul
On Aug 16, 2018, 3:35 PM -0700, Jan Høydahl , wrote:
> Check out the Reference Guide chapter on monitoring with o
with leader and replicas being spread around the cluster.
You would be bypassing general High availability / distributed computing
processes by trying to not reindex.
Rahul
On Aug 7, 2018, 7:06 AM -0400, Bjarke Buur Mortensen ,
wrote:
> Hi List,
>
> is there a cookbook recipe for
the _default configset for any collections created without
explicit configset.
Regards,
Rahul Chhiber
-Original Message-
From: Chuming Chen [mailto:chumingc...@gmail.com]
Sent: Thursday, July 26, 2018 11:35 PM
To: solr-user@lucene.apache.org
Subject: create collection from existing
Their commercial offering still has something like it. You can always try
Grafana
Rahul
On Jul 13, 2018, 9:59 AM -0400, rgummadi , wrote:
> Is SiLK from LucidWorks still an acitve project. I looked at their github and
> it does not seem to be active. If so are there any alternative sol
deduplication — the join I’m pretty sure works
on exact matches.
Consider creating a “identity” collection where you map the different names to
a unique identity key. This could then be technically be joined on two datasets
and then those could be joined again.
Rahul
On Jul 11, 2018, 4:42 PM -0400, Aroop
/solr/gettingstarted/select?q='*
<http://localhost:8983/solr/gettingstarted/select?q='*>'*
Please suggest me anything and let me know if I am missing anything
Thanks,
Rahul
Agreed. DIH is not an industrial grade ETL tool.. may want to consider other
options. May want to look into Kafka Connect as an alternative. It has
connectors for JDBC into Kafka, and from Kafka into Solr.
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On Jul 9, 2018, 6:14 AM -0500
Use -v option in the bin/solr start command.
Regards,
Rahul Chhiber
-Original Message-
From: Prateek Jain J [mailto:prateek.j.j...@ericsson.com]
Sent: Monday, July 09, 2018 4:26 PM
To: solr-user@lucene.apache.org
Subject: cmd to enable debug logs
Hi All,
What's the command (fro
Have you tried changing the log level
https://lucene.apache.org/solr/guide/7_2/configuring-logging.html
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On Jul 8, 2018, 8:54 PM -0500, Yasufumi Mizoguchi ,
wrote:
> Hi,
>
> I am trying to indexing files into Solr 7.2 using da
is a work in progress and I'll update this with screenshots as well as
with links from other contributors.
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
If it’s windows it may be using a tool called NSSM to manage the solr service.
Look at windows services and task scheduler and understand if solr services are
being managed by windows via services or the task scheduler — or just .batch
files.
Rahul
On Jun 20, 2018, 11:34 AM -0400, Shawn Heisey
are some decent distributed shared file system services that could be
leveraged depending on the number of compute nodes.
Shared file system is the best way to keep it consistent but it comes with its
draw backs. You can always backup locally and asynchronously sync to shared FS
too.
--
Rahul
Right,
That’s why you need a place to persist the task list / graph. If you use a
table, you can set “processed” / “unprocessed” value … or a queue, then its
delivered only once .. otherwise you have to check indexed date from solr, and
waste a solr call.
--
Rahul Singh
rahul.si...@anant.us
.
http://saumitra.me/blog/tweet-search-and-analysis-with-kafka-solr-cassandra/
I dont know where this guys code went.. but the content is there with code
samples.
--
On May 23, 2018, 8:37 PM -0500, Raymond Xie , wrote:
> Thank you Rahul despite that's very high level.
>
> With
Enumerate the file locations (map) , put them in a queue like rabbit or Kafka
(Persist the map), have a bunch of threads , workers, containers, whatever pop
off the queue , process the item (reduce).
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On May 20, 2018, 7:24 AM -0400
Can try to leverage Spark to index. Or Kafka Connect with SolR.
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On May 14, 2018, 2:03 AM -0500, Mikhail Khludnev , wrote:
> A few years ago I provided server side concurrency "booster"
> https://issues.apache.org/jira/browse/
Having concurrent DIH for example from the same source on different cluster
nodes may cause duplicate work. But yes the ZK is what distributes the conf.
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On May 16, 2018, 4:55 AM -0500, Jon Morisi , wrote:
> Hi All,
> I'm
.
4. Unless you need highlighting, only index the actual contents, and store the
rest of the fields.
5. Shared File storage is probably ok, but you may want to do with a caching
later via Nginx and serve files through it. That way you don’t hit the disk
every time.
--
Rahul Singh
rahul.si
pipeline.
Best,
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On Apr 29, 2018, 6:27 AM -0700, Doug Turnbull
, wrote:
> Morphlines is a cloudera specific tool. I suspect moving Solr platforms
> will require you to rework your indexing somewhat. You may need to step
> back and think
process can improve the overall stability of the SolR service.
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On Apr 25, 2018, 12:49 PM -0400, Shawn Heisey , wrote:
> On 4/25/2018 4:02 AM, Lee Carroll wrote:
> > *We don't recommend using solr-cell for production indexing.*
>
CSV -> Spark -> SolR
https://github.com/lucidworks/spark-solr/blob/master/docs/examples/csv.adoc
If speed is not an issue there are other methods. Spring Batch / Spring Data
might have all the tools you need to get speed without Spark.
--
Rahul Singh
rahul.si...@anant.us
Anant Corpo
If you want speed, Spark is the fastest easiest way. You can connect to
relational tables directly and import or export to CSV / JSON and import from a
distributed filesystem like S3 or HDFS.
Combining a dfs with spark and a highly available SolR - you are maximizing all
threads.
--
Rahul
How much data and what is the database source? Spark is probably the fastest
way.
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar , wrote:
> Hi,
>
> We are using DIH with SortedMapBackedCache but as data size increases we
> nee
May need to extract outside SolR and index pure text with an external ingestion
process. You have much more control over the Tika attributes and behaviors.
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On Apr 9, 2018, 10:23 PM -0400, Zheng Lin Edwin Yeo ,
wrote:
> Hi,
>
> Cu
Maybe overthinking this. There is a “more like this” feature at basically does
this. Give that a try before digging deeper into the LTR methods. It may be
good enough for rock and roll.
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On Mar 28, 2018, 12:25 PM -0400, Xavier Schepler
because the
updates / selects are fast.
Ultimately I think SolR is like a 18 wheel tractor trailer and Elastic is like
a uhaul trucks and you can chain a bunch of them up to do what SolR does.
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On Mar 22, 2018, 9:04 AM -0500, Liu, Daphne
Parallel processing in any way will help, including Spark w/ a DFS like S3 or
HDFS. Your three machines could end up being a bottleneck and you may need more
nodes.
On Mar 20, 2018, 2:36 AM -0500, LOPEZ-CORTES Mariano-ext
, wrote:
> CSV file is 5GB aprox. for 29 millions.
>
> As you say Christo
Use a proxy server that only gives access to the update / select handlers
(URLs). Can do it with a numerous programming languages or with a simple proxy
in nginx.
The whole web server running SolR is not supposed to be out in the open. You
are opening yourself up to too many issues.
--
Rahul
1 - 100 of 267 matches
Mail list logo