Repository: spark
Updated Branches:
  refs/heads/master fa7c582e9 -> 1b9ba258e


[MINOR][DOCS] Fix few typos in structured streaming doc

## What changes were proposed in this pull request?

Minor typo in `even-time`, which is changed to `event-time` and a couple of 
grammatical errors fix.

## How was this patch tested?

N/A - since this is a doc fix. I did a jekyll build locally though.

Author: Ramkumar Venkataraman <[email protected]>

Closes #17037 from ramkumarvenkat/doc-fix.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1b9ba258
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1b9ba258
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1b9ba258

Branch: refs/heads/master
Commit: 1b9ba258e086e2ba89a4f35a54106e2f8a38b525
Parents: fa7c582
Author: Ramkumar Venkataraman <[email protected]>
Authored: Sat Feb 25 02:18:22 2017 +0000
Committer: Sean Owen <[email protected]>
Committed: Sat Feb 25 02:18:22 2017 +0000

----------------------------------------------------------------------
 docs/structured-streaming-programming-guide.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/1b9ba258/docs/structured-streaming-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index ad3b2fb..6af47b6 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -392,7 +392,7 @@ data, thus relieving the users from reasoning about it. As 
an example, let’s
 see how this model handles event-time based processing and late arriving data.
 
 ## Handling Event-time and Late Data
-Event-time is the time embedded in the data itself. For many applications, you 
may want to operate on this event-time. For example, if you want to get the 
number of events generated by IoT devices every minute, then you probably want 
to use the time when the data was generated (that is, event-time in the data), 
rather than the time Spark receives them. This event-time is very naturally 
expressed in this model -- each event from the devices is a row in the table, 
and event-time is a column value in the row. This allows window-based 
aggregations (e.g. number of events every minute) to be just a special type of 
grouping and aggregation on the even-time column -- each time window is a group 
and each row can belong to multiple windows/groups. Therefore, such 
event-time-window-based aggregation queries can be defined consistently on both 
a static dataset (e.g. from collected device events logs) as well as on a data 
stream, making the life of the user much easier.
+Event-time is the time embedded in the data itself. For many applications, you 
may want to operate on this event-time. For example, if you want to get the 
number of events generated by IoT devices every minute, then you probably want 
to use the time when the data was generated (that is, event-time in the data), 
rather than the time Spark receives them. This event-time is very naturally 
expressed in this model -- each event from the devices is a row in the table, 
and event-time is a column value in the row. This allows window-based 
aggregations (e.g. number of events every minute) to be just a special type of 
grouping and aggregation on the event-time column -- each time window is a 
group and each row can belong to multiple windows/groups. Therefore, such 
event-time-window-based aggregation queries can be defined consistently on both 
a static dataset (e.g. from collected device events logs) as well as on a data 
stream, making the life of the user much easier.
 
 Furthermore, this model naturally handles data that has arrived later than 
 expected based on its event-time. Since Spark is updating the Result Table, 
@@ -401,7 +401,7 @@ as well as cleaning up old aggregates to limit the size of 
intermediate
 state data. Since Spark 2.1, we have support for watermarking which 
 allows the user to specify the threshold of late data, and allows the engine
 to accordingly clean up old state. These are explained later in more 
-details in the [Window Operations](#window-operations-on-event-time) section.
+detail in the [Window Operations](#window-operations-on-event-time) section.
 
 ## Fault Tolerance Semantics
 Delivering end-to-end exactly-once semantics was one of key goals behind the 
design of Structured Streaming. To achieve that, we have designed the 
Structured Streaming sources, the sinks and the execution engine to reliably 
track the exact progress of the processing so that it can handle any kind of 
failure by restarting and/or reprocessing. Every streaming source is assumed to 
have offsets (similar to Kafka offsets, or Kinesis sequence numbers)
@@ -647,7 +647,7 @@ df.groupBy("deviceType").count()
 </div>
 
 ### Window Operations on Event Time
-Aggregations over a sliding event-time window are straightforward with 
Structured Streaming. The key idea to understand about window-based 
aggregations are very similar to grouped aggregations. In a grouped 
aggregation, aggregate values (e.g. counts) are maintained for each unique 
value in the user-specified grouping column. In case of window-based 
aggregations, aggregate values are maintained for each window the event-time of 
a row falls into. Let's understand this with an illustration. 
+Aggregations over a sliding event-time window are straightforward with 
Structured Streaming and are very similar to grouped aggregations. In a grouped 
aggregation, aggregate values (e.g. counts) are maintained for each unique 
value in the user-specified grouping column. In case of window-based 
aggregations, aggregate values are maintained for each window the event-time of 
a row falls into. Let's understand this with an illustration. 
 
 Imagine our [quick example](#quick-example) is modified and the stream now 
contains lines along with the time when the line was generated. Instead of 
running word counts, we want to count words within 10 minute windows, updating 
every 5 minutes. That is, word counts in words received between 10 minute 
windows 12:00 - 12:10, 12:05 - 12:15, 12:10 - 12:20, etc. Note that 12:00 - 
12:10 means data that arrived after 12:00 but before 12:10. Now, consider a 
word that was received at 12:07. This word should increment the counts 
corresponding to two windows 12:00 - 12:10 and 12:05 - 12:15. So the counts 
will be indexed by both, the grouping key (i.e. the word) and the window (can 
be calculated from the event-time).
 
@@ -713,7 +713,7 @@ old windows correctly, as illustrated below.
 
 ![Handling Late Data](img/structured-streaming-late-data.png)
 
-However, to run this query for days, its necessary for the system to bound the 
amount of 
+However, to run this query for days, it's necessary for the system to bound 
the amount of 
 intermediate in-memory state it accumulates. This means the system needs to 
know when an old 
 aggregate can be dropped from the in-memory state because the application is 
not going to receive 
 late data for that aggregate any more. To enable this, in Spark 2.1, we have 
introduced 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to