thanks you Upayavira,
I think i have done all these thing using SolrJ which was usefull before
starting development of the project.
I hope i will not got any of issue using SolrJ and got lots of stuff using
it.
Thanks
Mugeesh Husain
--
View this message in context:
http://lucene.472066.n3
@Mikhail Use of data import handler ,if i define my baseDir is
D:/work/folder. Will it work for sub-folder and sub-folder of sub-folder ...
etc also.?
--
View this message in context:
http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4221063.html
Sent fr
Post your docs in sets of 1000. Create a:
List docs
Then add 1000 docs to it, then client.add(docs);
Repeat until your 40m are indexed.
Upayavira
On Wed, Aug 5, 2015, at 05:07 PM, Mugeesh Husain wrote:
> filesystem are about 40 millions of document it will iterate 40 times how
> may
> solrJ c
filesystem are about 40 millions of document it will iterate 40 times how may
solrJ could not handle 40m times loops(before indexing i have to split
values from filename and make some operation then index to Solr)
Is it will continuous indexing using 40m times or i have to sleep in between
some i
If you are using Java, you will likely find SolrJ the best way - it uses
serialised Java objects to communicate with Solr - you don't need to
worry about that. Just use code similar to that earlier in the thread.
No XML, no CSV, just simple java code.
Upayavira
On Wed, Aug 5, 2015, at 04:50 PM,
@Upayavira
Thanks these thing are most useful for my understanding
I have thing about i will create XML or CVS file from my requirement using
java
Then Index it via HTTP post or bin/post
I am not using DIH because i did't get any of link or idea how to split
data and add to solr one by one.(A
On Tue, Aug 4, 2015 at 8:10 PM, Mugeesh Husain wrote:
> Thanks you Erik, I will preferred XML files instead of csv.
> On my requirement if i want to use DIH for indexing than how could i split
> these operation or include java clode to DIH..
>
Here is my favorite way to tweak data in DIH
https://
On Tue, Aug 4, 2015, at 06:13 PM, Mugeesh Husain wrote:
> @Upayavira if i uses Solrj for indexing. autocommit or Softautocommit
> will
> work in case of SolJ
There are two ways to get content into Solr:
* push it in via an HTTP post.
- this is what SolrJ uses, what bin/post uses, and every
@Upayavira if i uses Solrj for indexing. autocommit or Softautocommit will
work in case of SolJ
--
View this message in context:
http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220796.html
Sent from the Solr - User mailing list archive at Nabble.com.
Thanks you Erik, I will preferred XML files instead of csv.
On my requirement if i want to use DIH for indexing than how could i split
these operation or include java clode to DIH..
I have googled but not get such type of requirement.
provide my any of link for it or some suggestion to do it.
If you have data that only consists of id (full filename) and filename
(indexed, tokenized) 40M of those will fit comfortably into a single shard
provided enough RAM to operate.
I know SolrJ is tossed out there a lot as a/the way to index - but if you’ve
got a directory tree of files and want t
Thank @Alexandre and Erickson ,Hatcher.
I will generate ID of MD5 with help of filename using java.
I can do it with help of SolrJ nicely because i am java developer apart from
this
The question raised that data is too large i think it will break into
multiple shards(core)
Using multi core inde
Yes, you are right - generally autocommit is a better way. If you are
doing a one-off indexing, then a manual commit may well be the best
option, but generally, autocommit is a better way.
Upayavira
On Mon, Aug 3, 2015, at 11:15 PM, Konstantin Gribov wrote:
> Upayavira, manual commit isn't a good
Upayavira, manual commit isn't a good advice, especially with small bulks
or single document, is it? I see recommendations on using
autoCommit+autoSoftCommit instead of manual commit mostly.
вт, 4 авг. 2015 г. в 1:00, Upayavira :
> SolrJ is just a "SolrClient". In pseudocode, you say:
>
> SolrCli
SolrJ is just a "SolrClient". In pseudocode, you say:
SolrClient client = new
SolrClient("http://localhost:8983/solr/whatever";);
List docs = new ArrayList<>();
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", "abc123");
doc.addField("some-text-field", "I like it when the sun s
Well,
If it is just file names, I'd probably use SolrJ client, maybe with
Java 8. Read file names, split the name into parts with regular
expressions, stuff parts into different field names and send to Solr.
Java 8 has FileSystem walkers, etc to make it easier.
You could do it with DIH, but it wo
@Alexandre No i dont need a content of a file. i am repeating my requirement
I have a 40 millions of files which is stored in a file systems,
the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf
I just split all Value from a filename only,these values i have to index.
I am interested t
@Erik Hatcher You mean i have to use Solrj for indexing to it.(right ?)
Can Solrj handle large amount of data which i have mentioned previous post ?
If i will use DIH then how will i split value from filename etc.
I want to start my development in a right direction that why i am little
confuse on
Just to reconfirm, are you indexing file content? Because if you are,
you need to be aware most of the PDF do not extract well, as they do
not have text flow preserved.
If you are indexing PDF files, I would run a sample through Tika
directly (that's what Solr uses under the covers anyway) and see
Ahhh, listen to Hatcher if you're not indexing the _contents_ of the
files, just the filenames
Erick
On Mon, Aug 3, 2015 at 2:22 PM, Erik Hatcher wrote:
> Most definitely yes given your criteria below. If you don’t care for the
> text to be parsed and indexed within the files, a simple fil
Most definitely yes given your criteria below. If you don’t care for the text
to be parsed and indexed within the files, a simple file system crawler that
just got the directory listings and posted the file names split as you’d like
to Solr would suffice it sounds like.
—
Erik Hatcher, Senior S
I'd go with SolrJ personally. For a terabyte of data that (I'm inferring)
are PDF files and the like (aka "semi-structured documents) you'll
need to have Tika parse out the data you need to index. And doing
that through posting or DIH puts all the analysis on the Solr servers,
which will work, but
Hi Alexandre,
I have a 40 millions of files which is stored in a file systems,
the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf
1.)I have to split all underscore value from a filename and these value have
to be index to the solr.
2.)Do Not need file contains(Text) to index.
You Told me "
That's still a VERY open question. The answer is Yes, but the details
depend on the shape and source of your data. And the search you are
anticipating.
Is this a lot of entries with small number of fields. Or a -
relatively - small number of entries with huge field counts. Do you
need to store/ret
Hi,
I am new in solr development and have a same requirement and I have already
got some knowledge such as how many shard have to created such amount of
data at all. with help of googling.
I want to take Some suggestion there are so many method to do indexing such
as DIH,solr,Solrj.
Please sugges
ent: Tuesday, January 17, 2012 12:15 AM
>Subject: Re: Can Apache Solr Handle TeraByte Large Data
>
>I've been toying with the idea of setting up an experiment to index a large
>document set 1+ TB -- any thoughts on an open data set that one could use
>for this purpose?
>
>Than
I've been toying with the idea of setting up an experiment to index a large
document set 1+ TB -- any thoughts on an open data set that one could use
for this purpose?
Thanks.
On Mon, Jan 16, 2012 at 5:00 PM, Burton-West, Tom wrote:
> Hello ,
>
> Searching real-time sounds difficult with that am
Hello ,
Searching real-time sounds difficult with that amount of data. With large
documents, 3 million documents, and 5TB of data the index will be very large.
With indexes that large your performance will probably be I/O bound.
Do you plan on allowing phrase or proximity searches? If so, you
Hello,
>
> From: mustafozbek
>
>All documents that we use are rich text documents and we parse them with
>tika. we need to search real time.
Because of real-time requirement, you'll need to use unreleased/dev version of
Solr.
>Robert Stewart wrote
>> Any idea
Hello,
Inline
- Original Message -
> From: mustafozbek
>
> I am an apache solr user about a year. I used solr for simple search tools
> but now I want to use solr with 5TB of data. I assume that 5TB data will be
> 7TB when solr index it according to filter that I use. And then I will a
Any idea how many documents your 5TB data contains? Certain features such as
faceting depends more on # of total documents than on actual size of data.
I have tested approx. 1 TB (100 million documents) running on a single machine
(40 cores, 128 GB RAM), using distributed search across 10 shard
Maybe also have a look at these links.
http://www.hathitrust.org/blogs/large-scale-search/performance-5-million-volumes
http://www.hathitrust.org/blogs/large-scale-search
On Fri, 13 Jan 2012 15:49:06 +0100, Daniel Brügge
wrote:
> Hi,
>
> it's definitely a problem to store 5TB in Solr without u
Hi,
it's definitely a problem to store 5TB in Solr without using sharding. I try to
split data over solr instances,
so that the index will fit in my memory on the server.
I ran into trouble with a Solr using 50G index.
Daniel
On Jan 13, 2012, at 1:08 PM, mustafozbek wrote:
> I am an apache s
33 matches
Mail list logo