[I] Query on nested struct field with PyIceberg? [iceberg-python]

via GitHub Mon, 22 Jul 2024 05:03:31 -0700


cfrancois7 opened a new issue, #953:
URL: https://github.com/apache/iceberg-python/issues/953


   ### Question
   
   I'm looking for a tutorial to make a query on one subfield of a struct field.
   I scrolled all internet but failed to find a way to do it simply with 
pyiceberg.
   
   To make it concret, for instance how to get the row with `"employment.status 
= 'Employed'"` : 
   
   ```python
   [{'id': 1,
     'name': 'Alice',
     'age': 28,
     'address': {'street': '123 Maple St',
      'city': 'Springfield',
      'postal_code': '12345'},
     'contact': {'email': 'al...@example.com', 'phone': '555-1234'},
     'employment': {'status': 'Employed',
      'position': 'Software Engineer',
      'company': {'name': 'Tech Corp', 'location': 'Silicon Valley'}},
     'preferences': {'newsletter': True,
      'notifications': {'email': True, 'sms': False}}},
    {'id': 2,
     'name': 'Bob',
     'age': 35,
     'address': {'street': '456 Oak St',
      'city': 'Metropolis',
      'postal_code': '67890'},
     'contact': {'email': 'b...@example.com', 'phone': '555-5678'},
     'employment': {'status': 'Self-employed',
      'position': 'Consultant',
      'company': {'name': 'Freelance', 'location': 'Remote'}},
     'preferences': {'newsletter': False,
      'notifications': {'email': True, 'sms': True}}}]
   ```
      
   With the following schema:
   ```
    import pyarrow as pa
    
    schema = pa.schema([
     ('id', pa.int32()),
     ('name', pa.string()),
     ('age', pa.int32()),
     ('address', pa.struct([
         ('street', pa.string()),
         ('city', pa.string()),
         ('postal_code', pa.string())
     ])),
     ('contact', pa.struct([
         ('email', pa.string()),
         ('phone', pa.string())
     ])),
     ('employment', pa.struct([
         pa.field('status', pa.string(), nullable=True),
         pa.field('position', pa.string(), nullable=True),
         pa.field('company', pa.struct([
             ('name', pa.string()),
             ('location', pa.string())
         ]), nullable=True)
     ])),
     ('preferences', pa.struct([
         ('newsletter', pa.bool_()),
         ('notifications', pa.struct([
             ('email', pa.bool_()),
             ('sms', pa.bool_())
         ]))
     ]))
   ])
   ```
   
   I tried this kind of query, but without success:
   
   ```python
   row_filter = "employment.status = 'Employed'"
   
   table.scan(
       row_filter=row_filter,
       selected_fields=["age", "employment", 'contact.email']
   ).to_pandas()
   ```
   The command raises the error:
   ```Python
   ValueError: Could not find field with name status, case_sensitive=True
   ```
   
   The backend is supported by SQLite.
   
   versions:
   ```bash
   $ pip list | grep 'iceberg\|arrow\|sqlite'
   arrow                     1.3.0
   pyarrow                   15.0.2
   pyiceberg                 0.6.1
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Query on nested struct field with PyIceberg? [iceberg-python]

Reply via email to