Re: How do you parse the data in a field that is returned from a query?

Carl Roberts Sat, 24 Jan 2015 12:51:07 -0800

Via this rss-data-config.xml file and a class that I wrote (attached) todownload and XML file from a ZIP URL:


<dataConfig>

<dataSource type="ZIPURLDataSource" connectionTimeout="15000"readTimeout="30000"/>

    <document>
        <entity name="cve-2002"
                pk="id"
url="https://nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2002.xml.zip";
                processor="XPathEntityProcessor"
                forEach="/nvd/entry">

<field column="id" xpath="/nvd/entry/@id"commonField="false" /><field column="cve" xpath="/nvd/entry/cve-id"commonField="false" /><field column="cwe" xpath="/nvd/entry/cwe/@id"commonField="false" /><field column="vulnerable-configuration"xpath="/nvd/entry/vulnerable-configuration/logical-test/fact-ref/@name"commonField="false" /><field column="vulnerable-software"xpath="/nvd/entry/vulnerable-software-list/product" commonField="false" /><field column="published"xpath="/nvd/entry/published-datetime" commonField="false" /><field column="modified"xpath="/nvd/entry/last-modified-datetime" commonField="false" /><field column="summary" xpath="/nvd/entry/summary"commonField="false" />

        </entity>
        <entity name="cve-2003"
                pk="id"
url="http://nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2003.xml.zip";
                processor="XPathEntityProcessor"
                forEach="/nvd/entry">

        </entity>
        <!--
        <entity name="nvd-rss-update"
                pk="link"
                url="https://nvd.nist.gov/download/nvd-rss.xml";
                processor="XPathEntityProcessor"
                forEach="/RDF/item"
                transformer="DateFormatTransformer"
                preImportDeleteQuery="">

<field column="id" xpath="/RDF/item/title"commonField="true" /><field column="link" xpath="/RDF/item/link"commonField="true" /><field column="summary" xpath="/RDF/item/description"commonField="true" /><field column="date" xpath="/RDF/item/date"commonField="true" />

        </entity>
        -->
    </document>
</dataConfig>


On 1/24/15, 3:45 PM, Jack Krupansky wrote:

How are you currently importing data?

-- Jack Krupansky

On Sat, Jan 24, 2015 at 3:42 PM, Carl Roberts <carl.roberts.zap...@gmail.com

wrote:
Sorry if I was not clear.  What I am asking is this:

How can I parse the data during import to tokenize it by (:) and strip the
cpe:/o?



On 1/24/15, 3:28 PM, Alexandre Rafalovitch wrote:

You are using keywords here that seem to contradict with each other.
Or your use case is not clear.

Specifically, you are saying you are getting stuff from a (Solr?)
query. So, the results are now outside of Solr. Then you are asking
for help to strip stuff off it. Well, it's outside of Solr, do
whatever you want with it!

But then at the end, you say you want to search for whatever you
stripped off. So, that should be back in Solr again?

Or are you asking something along these lines:
1. I have a multiValued field with the following sample content... (it
does not matter to Solr where it comes from)
2. I wanted it returned as is, but I want to be able to find documents
when somebody searches for X, Y, or Z
3. What would be the best analyzer chain to be able to do so?

Regards,
     Alex.
----
Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 24 January 2015 at 15:04, Carl Roberts <carl.roberts.zap...@gmail.com>
wrote:

Hi,

How can I parse the data in a field that is returned from a query?

Basically,

I have a multi-valued field that contains values such as these that are
returned from a query:

            "cpe:/o:freebsd:freebsd:1.1.5.1",
            "cpe:/o:freebsd:freebsd:2.2.3",
            "cpe:/o:freebsd:freebsd:2.2.2",
            "cpe:/o:freebsd:freebsd:2.2.5",
            "cpe:/o:freebsd:freebsd:2.2.4",
            "cpe:/o:freebsd:freebsd:2.0.5",
            "cpe:/o:freebsd:freebsd:2.2.6",
            "cpe:/o:freebsd:freebsd:2.1.6.1",
            "cpe:/o:freebsd:freebsd:2.0.1",
            "cpe:/o:freebsd:freebsd:2.2",
            "cpe:/o:freebsd:freebsd:2.0",
            "cpe:/o:openbsd:openbsd:2.3",
            "cpe:/o:freebsd:freebsd:3.0",
            "cpe:/o:freebsd:freebsd:1.1",
            "cpe:/o:freebsd:freebsd:2.1.6",
            "cpe:/o:openbsd:openbsd:2.4",
            "cpe:/o:bsdi:bsd_os:3.1",
            "cpe:/o:freebsd:freebsd:1.0",
            "cpe:/o:freebsd:freebsd:2.1.7",
            "cpe:/o:freebsd:freebsd:1.2",
            "cpe:/o:freebsd:freebsd:2.1.5",
            "cpe:/o:freebsd:freebsd:2.1.7.1"],

And my problem is that I need to strip the cpe:/o part and I also need to
tokenize words using the (:) as a separator so that I can then search for
"freebsd 1.1" or "openbsd 2.4" or just "freebsd".

Thanks in advance.

Joe

package org.apache.solr.handler.dataimport;

import java.util.zip.*;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.*;
import java.net.URL;
import java.net.URLConnection;
import java.nio.charset.StandardCharsets;
import java.util.Properties;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.EOFException;

public class ZIPURLDataSource extends DataSource<Reader> {
        Logger LOG = LoggerFactory.getLogger(ZIPURLDataSource.class);

        private String baseUrl;

        private String encoding;

        private int connectionTimeout = CONNECTION_TIMEOUT;

        private int readTimeout = READ_TIMEOUT;

        private Context context;

        private Properties initProps;


        public ZIPURLDataSource(){
                super();
        }
        
        @Override
        public void init(Context context, Properties initProps) {
                this.context = context;
                this.initProps = initProps;

                baseUrl = getInitPropWithReplacements(BASE_URL);
                if (getInitPropWithReplacements(ENCODING) != null)
                  encoding = getInitPropWithReplacements(ENCODING);
                String cTimeout = 
getInitPropWithReplacements(CONNECTION_TIMEOUT_FIELD_NAME);
                String rTimeout = 
getInitPropWithReplacements(READ_TIMEOUT_FIELD_NAME);
                if (cTimeout != null) {
                  try {
                        connectionTimeout = Integer.parseInt(cTimeout);
                  } catch (NumberFormatException e) {
                        LOG.warn("Invalid connection timeout: " + cTimeout);
                  }
                }
                if (rTimeout != null) {
                  try {
                        readTimeout = Integer.parseInt(rTimeout);
                  } catch (NumberFormatException e) {
                        LOG.warn("Invalid read timeout: " + rTimeout);
                  }
                }
        }

        
        @Override
   public Reader getData(String query) {
    URL url = null;
    try {
      if (URIMETHOD.matcher(query).find()) 
        url = new URL(query);
      else 
        url = new URL(baseUrl + query);

      LOG.debug("Accessing URL: " + url.toString());

      URLConnection conn = url.openConnection();
      conn.setConnectTimeout(connectionTimeout);
      conn.setReadTimeout(readTimeout);
      InputStream in = conn.getInputStream();
      if (in == null){
                LOG.info("Invalid InputStream {" + in + "}");
        return null;
      }
      InputStream bis = unzip(in);
      in.close();
      in = bis;
      if (in != null){
        LOG.info("bytes available are {" + in.available() + "}");
      }else{
        LOG.info("Invalid InputStream {" + in + "}");
        return null;
      }
      String enc = encoding;
      if (enc == null) {
        String cType = conn.getContentType();
        if (cType != null) {
          Matcher m = CHARSET_PATTERN.matcher(cType);
          if (m.find()) {
            enc = m.group(1);
          }
        }
      }
      if (enc == null)
        enc = UTF_8;
      LOG.info("Encoding={" + enc + "}");
      DataImporter.QUERY_COUNT.get().incrementAndGet();
      InputStreamReader ir = new InputStreamReader(in, enc);
      LOG.info("InputStreamReader.ready()={" + ir.ready() + "}");
      LOG.info("InputStreamReader.getEncoding()={" + ir.getEncoding() + "}");
      return ir;
    } catch (Exception e) {
      LOG.error("Exception thrown while getting data", e);
      throw new DataImportHandlerException(DataImportHandlerException.SEVERE,
              "Exception in invoking url " + url, e);
    }
  }

  @Override
  public void close() {
  }
  
  private InputStream unzip(InputStream in) throws Exception{

        ZipInputStream zin = null;
        ZipEntry entry = null;
        try{
            zin = new ZipInputStream(in);
            
            //loop only once
            while((entry = zin.getNextEntry()) != null){
            
                byte raw[] = new byte[1024];
                int read = 0;
                        ByteArrayOutputStream bos = new ByteArrayOutputStream();
                        while ((read = zin.read(raw)) != -1) {
                        bos.write(raw, 0, read);
                        }
                        bos.close();
                zin.closeEntry();
                raw = bos.toByteArray();
                LOG.info("raw bytes={" + raw.length + "}");
                putBinaryToFile(raw, "raw.xml");
                return new ByteArrayInputStream(raw);
               
            }
        }finally{
            if (zin != null){
                try{
                    zin.close();
                }catch(Exception e){}
            }
        }
        return null;
    }
    
    public String getBaseUrl() {
    return baseUrl;
  }

  private String getInitPropWithReplacements(String propertyName) {
    final String expr = initProps.getProperty(propertyName);
    if (expr == null) {
      return null;
    }
    return context.replaceTokens(expr);
  }

    private void putBinaryToFile(byte[] buf, String fileName) throws 
IOException {
        putBinaryToFile(buf, 0, buf.length, fileName);
    }

    
    private void putBinaryToFile(byte[] buf, int off, int len, String fileName) 
throws IOException {
    
        FileOutputStream fos = null;
        BufferedOutputStream out = null;
        try{
                fos = new FileOutputStream(fileName);
                out = new BufferedOutputStream(fos);
                out.write(buf, off, len);
        }finally{
                if (out != null){
                        try{
                                out.close();
                        }catch(Exception e){
                                LOG.error(e.getMessage());
                        }
                }
                if (fos != null){
                        try{
                                fos.close();
                        }catch(Exception e){
                                LOG.error(e.getMessage());
                        }
                }
        }
    }

  static final Pattern URIMETHOD = Pattern.compile("\\w{3,}:/");

  private static final Pattern CHARSET_PATTERN = 
Pattern.compile(".*?charset=(.*)$", Pattern.CASE_INSENSITIVE);

  public static final String ENCODING = "encoding";

  public static final String BASE_URL = "baseUrl";

  public static final String UTF_8 = StandardCharsets.UTF_8.name();

  public static final String CONNECTION_TIMEOUT_FIELD_NAME = 
"connectionTimeout";

  public static final String READ_TIMEOUT_FIELD_NAME = "readTimeout";

  public static final int CONNECTION_TIMEOUT = 5000;

  public static final int READ_TIMEOUT = 10000;
}

Re: How do you parse the data in a field that is returned from a query?

Reply via email to