https://github.com/bazaarvoice/jolt
On Thu, Sep 13, 2018 at 9:18 AM Joel Bernstein wrote:
> Solr Streaming Expressions allow you to do this with the cartesianProduct
> function:
>
>
> http://lucene.apache.org/solr/guide/7_4/stream-decorator-reference.html#cartesianproduct
>
> The structure of th
Depends on whether you are using Solr or solrcloud. Solrcloud distributes data
into shards so it increases overall capacity.
Rahul Singh
Chief Executive Officer
m 202.905.2818
Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007
We build and manage digital business
waste of space.
Rahul Singh
Chief Executive Officer
m 202.905.2818
Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007
We build and manage digital business technology platforms.
On Sep 11, 2018, 11:23 PM -0400, John Smith , wrote:
> On Tue, Sep 11, 2018 at 11:05 PM Wal
” query.
Rahul Singh
Chief Executive Officer
m 202.905.2818
Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007
We build and manage digital business technology platforms.
On Sep 3, 2018, 6:29 AM -0400, Emir Arnautović ,
wrote:
> Hi,
> The requirement is not 100% cl
I wrote something related to this topic a while ago.
https://www.google.com/amp/s/blog.anant.us/resources-for-monitoring-datastax-cassandra-spark-solr-performance/amp/
Rahul
On Aug 16, 2018, 3:35 PM -0700, Jan Høydahl , wrote:
> Check out the Reference Guide chapter on monitoring with open source
Bjarke,
I am imagining that at some point you may need to shard that data if it grows.
Or do you imagine this data to remain stagnant?
Generally you want to add solrcloud to do two things : 1. Increase availability
with replicas 2. Increase available data via shards 3. Increase fault tolerance
Their commercial offering still has something like it. You can always try
Grafana
Rahul
On Jul 13, 2018, 9:59 AM -0400, rgummadi , wrote:
> Is SiLK from LucidWorks still an acitve project. I looked at their github and
> it does not seem to be active. If so are there any alternative solutions.
>
>
How do you define similarity? There are various different methods that work for
different methods. In solr depending on which index time analyzer / tokenizer
you are using, it will treat one company name as similar in one scenario and
not in another.
This seems like a case of data deduplication
Agreed. DIH is not an industrial grade ETL tool.. may want to consider other
options. May want to look into Kafka Connect as an alternative. It has
connectors for JDBC into Kafka, and from Kafka into Solr.
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On Jul 9, 2018, 6:14 AM -0500
Have you tried changing the log level
https://lucene.apache.org/solr/guide/7_2/configuring-logging.html
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On Jul 8, 2018, 8:54 PM -0500, Yasufumi Mizoguchi ,
wrote:
> Hi,
>
> I am trying to indexing files into Solr 7.2 using da
is a work in progress and I'll update this with screenshots as well as
with links from other contributors.
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
If it’s windows it may be using a tool called NSSM to manage the solr service.
Look at windows services and task scheduler and understand if solr services are
being managed by windows via services or the task scheduler — or just .batch
files.
Rahul
On Jun 20, 2018, 11:34 AM -0400, Shawn Heisey
are some decent distributed shared file system services that could be
leveraged depending on the number of compute nodes.
Shared file system is the best way to keep it consistent but it comes with its
draw backs. You can always backup locally and asynchronously sync to shared FS
too.
--
Rahul
Right,
That’s why you need a place to persist the task list / graph. If you use a
table, you can set “processed” / “unprocessed” value … or a queue, then its
delivered only once .. otherwise you have to check indexed date from solr, and
waste a solr call.
--
Rahul Singh
rahul.si...@anant.us
aring more from you or anyone in this
> Solr community.
>
>
>
> Sincerely yours,
>
>
> Raymond
>
> > On Wed, May 23, 2018 at 8:15 AM, Rahul Singh
> > wrote:
> > > Enumerate the file locations (map
Enumerate the file locations (map) , put them in a queue like rabbit or Kafka
(Persist the map), have a bunch of threads , workers, containers, whatever pop
off the queue , process the item (reduce).
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On May 20, 2018, 7:24 AM -0400
Can try to leverage Spark to index. Or Kafka Connect with SolR.
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On May 14, 2018, 2:03 AM -0500, Mikhail Khludnev , wrote:
> A few years ago I provided server side concurrency "booster"
> https://issues.apache.org/jira/browse/
Having concurrent DIH for example from the same source on different cluster
nodes may cause duplicate work. But yes the ZK is what distributes the conf.
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On May 16, 2018, 4:55 AM -0500, Jon Morisi , wrote:
> Hi All,
> I'm
.
4. Unless you need highlighting, only index the actual contents, and store the
rest of the fields.
5. Shared File storage is probably ok, but you may want to do with a caching
later via Nginx and serve files through it. That way you don’t hit the disk
every time.
--
Rahul Singh
rahul.si
pipeline.
Best,
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On Apr 29, 2018, 6:27 AM -0700, Doug Turnbull
, wrote:
> Morphlines is a cloudera specific tool. I suspect moving Solr platforms
> will require you to rework your indexing somewhat. You may need to step
> back and think
process can improve the overall stability of the SolR service.
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On Apr 25, 2018, 12:49 PM -0400, Shawn Heisey , wrote:
> On 4/25/2018 4:02 AM, Lee Carroll wrote:
> > *We don't recommend using solr-cell for production indexing.*
>
CSV -> Spark -> SolR
https://github.com/lucidworks/spark-solr/blob/master/docs/examples/csv.adoc
If speed is not an issue there are other methods. Spring Batch / Spring Data
might have all the tools you need to get speed without Spark.
--
Rahul Singh
rahul.si...@anant.us
Anant Corpo
t by merging distinct RDBMS tables in using RDD?
>
> On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh wrote:
>
> > How much data and what is the database source? Spark is probably the
> > fastest way.
> >
> > --
> > Rahul Singh
> > rahul.si...@anant.us
&
How much data and what is the database source? Spark is probably the fastest
way.
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar , wrote:
> Hi,
>
> We are using DIH with SortedMapBackedCache but as data size increases we
> nee
May need to extract outside SolR and index pure text with an external ingestion
process. You have much more control over the Tika attributes and behaviors.
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On Apr 9, 2018, 10:23 PM -0400, Zheng Lin Edwin Yeo ,
wrote:
> Hi,
>
> Cu
Maybe overthinking this. There is a “more like this” feature at basically does
this. Give that a try before digging deeper into the LTR methods. It may be
good enough for rock and roll.
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On Mar 28, 2018, 12:25 PM -0400, Xavier Schepler
because the
updates / selects are fast.
Ultimately I think SolR is like a 18 wheel tractor trailer and Elastic is like
a uhaul trucks and you can chain a bunch of them up to do what SolR does.
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On Mar 22, 2018, 9:04 AM -0500, Liu, Daphne
Parallel processing in any way will help, including Spark w/ a DFS like S3 or
HDFS. Your three machines could end up being a bottleneck and you may need more
nodes.
On Mar 20, 2018, 2:36 AM -0500, LOPEZ-CORTES Mariano-ext
, wrote:
> CSV file is 5GB aprox. for 29 millions.
>
> As you say Christo
Use a proxy server that only gives access to the update / select handlers
(URLs). Can do it with a numerous programming languages or with a simple proxy
in nginx.
The whole web server running SolR is not supposed to be out in the open. You
are opening yourself up to too many issues.
--
Rahul
may be more work but it’s more
scalable. Go big or go home. ;)
Hope it helps
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On Mar 18, 2018, 11:14 AM -0400, Steven White , wrote:
> Hi everyone,
>
> I have a design problem that i"m not sure how to solve best so I figured I
&
response inline.
On Thu, May 7, 2015 at 7:01 PM, Shawn Heisey wrote:
> On 5/7/2015 3:43 AM, Rahul Singh wrote:
> > I have tried to deploy solr.war from building it from 4.7.2 but it is
> > showing the below mentioned error. Has anyone faced the same? any lead
> > woul
Hi,
I have tried to deploy solr.war from building it from 4.7.2 but it is
showing the below mentioned error. Has anyone faced the same? any lead
would also be appreciated.
Error Message:
{
"responseHeader": {
"status": 500,
"QTime": 33
},
"error": {
"msg": "parsing error",
one of the measurement criteria is DCG.
http://en.wikipedia.org/wiki/Discounted_cumulative_gain
On Tue, Apr 1, 2014 at 11:44 AM, Floyd Wu wrote:
> Usually IR system is measured using Precision & Recall.
> But depends on what kind of system you are developing to fit what scenario.
>
> Take a lo
33 matches
Mail list logo