Re: Adding pdf/word file using JSON/XML

Roland Everaert Thu, 13 Jun 2013 00:33:42 -0700

I apologize also for my obscure questions and I thanks you and the list for
your help so far and the very clear explanation you give about the
behaviour of Solr and SolrCell.


I am effectively an intermediary between the list and the dev, because our
development process is not efficient. The full story is (beware its
boring), we are a bunch of devs in a consultancy company waiting for the
next mission. In the mean time, our boss gives us something to do, but
instead of developing a big application where each dev has a module to care
of, or working each on its own machine. We have to develop the same
application with various technologies/tools/language. One is using .NET,
another is using Java and the spring framework and the 3rd one is using
JavaEE. And I am in the middle as a sysadmin/dba/investigator of tools and
API/provider of information and transparent API for everybody while
managing 3 databases, 2 application servers and 2 different indexers on the
same server and take into consideration that at some points in time the
devs will interchange their tools (rdbms and/or indexers) *now you can
breath*.

Top that with the fact that, one of the dev is experienced in REST and web
technologies (the IDIOT ;)) and that I have misread the first line of the
Solr feature page (Solr is a standalone enterprise search server with a
REST-like API), I actually communicate that Solr provides a RESTful API.

So I think I am a bit overwhelmed by the task at hand.

To conclude, yesterday I discuss with the team and we decide that I will
provide a RESTful web service that will hide the access to the indexers
among other things, so even the .NET guy will be able to use it. That will
allow me to study REST and, I hope, make clearer questions in the future.

Thanks again for your help and your patience,


Roland Everaert.




On Wed, Jun 12, 2013 at 4:18 PM, Jack Krupansky <j...@basetechnology.com>wrote:

> I'm sorry if I came across as aggressive or insulting - I'm only trying to
> dig down to what your actual difficulty is - and you have been making that
> extremely difficult for all of us. You need to help us all out here by more
> clearly expressing what your actual problem is. You will have to excuse the
> rest of us if we are unable to read your mind!
>
> It sounds as if you are an intermediary between your devs and this list.
> That's NOT a very effective communications strategy! You need to either
> have your devs communicate directly on this list, or you need to do a much
> better job of understanding what their actual problem is and then
> communicate that actual problem to this list, plainly and clearly.
>
> TRYING to read your mind (and indirectly your devs' minds as well - not an
> easy task!), and reading between the lines, it is starting to sound as if
> you (or/and your devs) are not clear on how Solr works as a "database".
>
> Core Solr does have full CRUD (Add or Create, Read or Query, Update, and
> Delete), although not in a strict, pure REST sense, that is true.
>
> A "full" update in Solr is the same as an Add - add a new, fresh document,
> and then delete the old document. Some people call this an "Upsert"
> (combination of Update or Insert).
>
> There are really two forms of update (a difficulty in REST): 1) full
> update or "replace" - equal to a delete and an add, and 2) partial or
> incremental update. True REST only has the latter
>
> Core Solr does have support for partial or incremental Update with Atomic
> Updates. Solr will in fact retain the existing data and only update any new
> field values that are supplied on the update request.
>
> SolrCell (Extracting RequestHandler or "/update/extract") is not a core
> part of Solr. It is an add on "contrib" module. It does not have full CRUD
> - no delete, and no partial update, but it does support add and full update.
>
> As someone else already suggested, you can do the work of SolrCell
> yourself by calling Tika directly in your app layer and then sending normal
> Solr CRUD requests.
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: Roland Everaert
> Sent: Wednesday, June 12, 2013 5:21 AM
>
> To: solr-user@lucene.apache.org
> Subject: Re: Adding pdf/word file using JSON/XML
>
> 1) Being aggressive and insulting is not a way to help people understand
> such complex tool or to help people in general.
>
> 2) I read again the feature page of Solr and it is stated that the
> interface is REST-like and not RESTful as I though in the first place, and
> communicate to the devs. And as the devs told me a RESTful interface
> doesn't use parameters in the URI/URL, so ii is my mistake. Hence we have
> no problem with the interface as it is.
>
> Any way I still have a question regarding the /extract interface. It seems
> that every time a file is updated in Solr, the lucene document is recreated
> from scratch which means that any extra information we want to be
> indexed/stored along the file is erased if the request doesn't contains
> them. Is there a parameter that allow changing that behaviour?
>
>
>
> Regards,
>
>
> Roland.
>
>
> On Tue, Jun 11, 2013 at 4:35 PM, Jack Krupansky <j...@basetechnology.com>*
> *wrote:
>
>  "is it possible to index the file + metadata with a JSON/XML request?"
>>
>> You still aren't being clear as to what you are really trying to achieve
>> here. I mean, just write a shell script that does the curl command, or
>> write a Java program or application layer that uses SolrJ to talk to Solr
>> and accepts JSON?XML/REST requests.
>>
>>
>> "It seems that the only way to index a file with some metadata is to build
>> a
>> request that would look like the following example that uses curl."
>>
>> Curl is just a fancy way to do an HTTP request. You can do the same HTTP
>> request from Java code (or Python or whatever.)
>>
>>
>> "The developer would like to avoid using parameters in the url to pass
>> arguments."
>>
>> Seriously?! What is THAT all about!!  I mean, really, HTTP and URLs and
>> URL query parameters are part of the heart of the Internet infrastructure!
>>
>> If this whole thread is merely that you have an IDIOT who can't cope with
>> passing HTTP URL query parameters, all I can say is... Wow!
>>
>> But use SolrJ and then at least it doesn't LOOK like they are URL Query
>> parameters.
>>
>> Or, maybe this is just a case where the developer WANTS to use SOAP rather
>> than a REST style of API.
>>
>> In any case, please clue us in as to what PROBLEM you are really trying to
>> solve. Just use plain English and avoid getting caught up in what the
>> solution might be.
>>
>> The real bottom line is that random application developers should not be
>> talking directly to Solr anyway - they should be provided with an
>> "application layer" that has a clean, application-oriented REST API and
>> the
>> gory details of the Solr API would be hidden inside the application layer.
>>
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Roland Everaert
>> Sent: Tuesday, June 11, 2013 8:48 AM
>>
>> To: solr-user@lucene.apache.org
>> Subject: Re: Adding pdf/word file using JSON/XML
>>
>> We are working on an application that allows some users to add files (pdf,
>> ms word, odt, etc), located on their local hard disk, to our internal
>> system and allows other users to search for them. So we are considering
>> Solr for the indexing and search functionalities of the system. Along with
>> the file content, we want to index some metadata related to the file.
>>
>> It seems obvious that Solr couldn't import the file from the local disk of
>> the user, so the system will have to import the file into a directory that
>> Solr can reach and instruct Solr to index the file with the metadata, but
>> is it possible to index the file + metadata with a JSON/XML request?
>>
>> It seems that the only way to index a file with some metadata is to build
>> a
>> request that would look like the following exemple that uses curl. The
>> developer would like to avoid using parameters in the url to pass
>> arguments.
>>
>> curl "
>> http://localhost:8080/solr/****update/extract?literal.id=**<http://localhost:8080/solr/**update/extract?literal.id=**>
>> doc10&literal.name=BLAH&****defaultField=text<http://**
>> localhost:8080/solr/update/**extract?literal.id=doc10&**
>> literal.name=BLAH&**defaultField=text<http://localhost:8080/solr/update/extract?literal.id=doc10&literal.name=BLAH&defaultField=text>
>> >
>> "
>> --data-binary @/path/to/file.pdf -H "Content-Type: application/pdf"
>>
>>
>> Additionally, it seems that if a subsequent request is sent to the indexer
>> to update the file, if the metadata are not passed to Solr with the
>> request, they are deleted.
>>
>> Thanks for your help,
>>
>>
>>
>> Roland.
>>
>>
>> On Mon, Jun 10, 2013 at 4:14 PM, Jack Krupansky <j...@basetechnology.com
>> >*
>> *wrote:
>>
>>
>>  Sorry, but you are STILL not being clear!
>>
>>>
>>> Are you asking if you can pass Solr parameters as XML fields? No.
>>>
>>> Are you asking if the file name and path can be indexed as metadata? To
>>> some degree:
>>>
>>> curl "http://localhost:8983/solr/******update/extract?literal.id=**
>>> doc-****1\<http://localhost:8983/solr/****update/extract?literal.id=doc-****1%5C>
>>> <http://localhost:**8983/solr/**update/extract?**literal.id=doc-**1%5C<http://localhost:8983/solr/**update/extract?literal.id=doc-**1%5C>
>>> >
>>> <http://localhost:8983/**solr/**update/extract?literal.**id=**doc-1%5C<http://localhost:8983/**solr/update/extract?literal.**id=doc-1%5C>
>>> <http://localhost:**8983/solr/update/extract?**literal.id=doc-1%5C<http://localhost:8983/solr/update/extract?literal.id=doc-1%5C>
>>> >
>>> >
>>> &commit=true&uprefix=attr_" -F "HelloWorld.docx=@HelloWorld.******docx"
>>>
>>>
>>> Then the stream has a name that is indexed as metadata:
>>>
>>> <arr name="attr_meta">
>>>  <str>stream_source_info</str>
>>>  <str>HelloWorld.docx</str>
>>>  <str>stream_content_type</str>
>>>  <str>application/octet-stream<******/str>
>>>
>>>
>>>  <str>stream_size</str>
>>>  <str>10096</str>
>>>  <str>stream_name</str>
>>>  <str>HelloWorld.docx</str>
>>>  <str>Content-Type</str>
>>>  <str>application/vnd.******openxmlformats-officedocument.******
>>> wordprocessingml.document</******str>
>>> </arr>
>>>
>>> and
>>>
>>> <arr name="attr_stream_source_info"******>
>>>
>>>
>>>  <str>HelloWorld.docx</str>
>>> </arr>
>>>
>>> <arr name="attr_stream_name">
>>>  <str>HelloWorld.docx</str>
>>> </arr>
>>>
>>> Or, what is it that you are really string to do?
>>>
>>> Simply tell us in plain language what problem you are trying to solve.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Roland Everaert
>>> Sent: Monday, June 10, 2013 9:23 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Adding pdf/word file using JSON/XML
>>>
>>>
>>> Sorry if it was not clear.
>>>
>>> What I would like is to know how to construct an XML/JSON request that
>>> provide any necessary information (supposedly the full path on disk) to
>>> solr to retrieve and index a pdf/ms word document.
>>>
>>> So, an XML request could look like this:
>>>
>>> <add>
>>> <doc>
>>> <field name="id">doc10</field>
>>> <field name="name">BLAH</field>
>>> <field name="path">/path/to/file.pdf<******/field>
>>>
>>>
>>> </doc>
>>> </add>
>>>
>>>
>>> Regards,
>>>
>>>
>>> Roland.
>>>
>>>
>>> On Mon, Jun 10, 2013 at 3:12 PM, Gora Mohanty <g...@mimirtech.com>
>>> wrote:
>>>
>>>  On 10 June 2013 17:47, Roland Everaert <reveatw...@gmail.com> wrote:
>>>
>>>  > Hi,
>>>> >
>>>> > Based on the wiki, below is an example of how I am currently adding a
>>>> >  >
>>>> pdf
>>>> > file with an extra field called name:
>>>> > curl "
>>>> >
>>>> http://localhost:8080/solr/******update/extract?literal.id=**<http://localhost:8080/solr/****update/extract?literal.id=**>
>>>> <**http://localhost:8080/solr/****update/extract?literal.id=**<http://localhost:8080/solr/**update/extract?literal.id=**>
>>>> >
>>>> doc10&literal.name=BLAH&******defaultField=text<http://**
>>>> localhost:8080/solr/update/****extract?literal.id=doc10&**
>>>>
>>>> literal.name=BLAH&****defaultField=text<http://**
>>>> localhost:8080/solr/update/**extract?literal.id=doc10&**
>>>> literal.name=BLAH&**defaultField=text<http://localhost:8080/solr/update/extract?literal.id=doc10&literal.name=BLAH&defaultField=text>
>>>> >
>>>> >
>>>>
>>>> "
>>>> > --data-binary @/path/to/file.pdf -H "Content-Type: application/pdf"
>>>> >
>>>> > Is it possible to add a file + any extra fields using a JSON or XML
>>>> request.
>>>>
>>>> It is not entirely clear what you are asking. Do you mean
>>>> can one do the same as your example above for a PDF
>>>> file, but with a XML or JSON file? If so, yes. Please see
>>>> the examples in example/exampledocs/ of a Solr source
>>>> tree, and 
>>>> http://wiki.apache.org/solr/******ExtractingRequestHandler<http://wiki.apache.org/solr/****ExtractingRequestHandler>
>>>> <htt**p://wiki.apache.org/solr/****ExtractingRequestHandler<http://wiki.apache.org/solr/**ExtractingRequestHandler>
>>>> >
>>>> <http:**//wiki.apache.org/**solr/****ExtractingRequestHandler<http://wiki.apache.org/solr/**ExtractingRequestHandler>
>>>> <http:**//wiki.apache.org/solr/**ExtractingRequestHandler<http://wiki.apache.org/solr/ExtractingRequestHandler>
>>>> >
>>>> >
>>>>
>>>> Regards,
>>>> Gora
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Adding pdf/word file using JSON/XML

Reply via email to