Docs for data oriented schema (#290)

* Started doc on schemas * First draft of schema doc * Proofread data-oriented schema doc * Made type-shorthand more consistent
zero-one-group · Nov 10, 2020 · bd71983 · bd71983
1 parent 80f7835
commit bd71983
Show file tree

Hide file tree

Showing 5 changed files with 78 additions and 31 deletions.
diff --git a/README.md b/README.md
@@ -31,6 +31,7 @@ Geni provides an idiomatic Spark interface for Clojure without the hassle of Jav
             <li><a href="docs/simple_performance_benchmark.md">A Simple Performance Benchmark</a></li>
             <li><a href="CODE_OF_CONDUCT.md">Code of Conduct</a></li>
             <li><a href="CONTRIBUTING.md">Contributing Guide</a></li>
+            <li><a href="docs/creating_spark_schemas.md">Creating Spark Schemas</a></li>
             <li><a href="docs/examples.md">Examples</a></li>
             <li><a href="docs/design_goals.md">Design Goals</a></li>
             <li><a href="docs/semantics.md">Geni Semantics</a></li>
@@ -42,7 +43,7 @@ Geni provides an idiomatic Spark interface for Clojure without the hassle of Jav
             <li><a href="docs/spark_session.md">Where's The Spark Session</a></li>
             <li><a href="docs/why.md">Why?</a></li>
             <li><a href="docs/sql_maps.md">Working with SQL Maps</a></li>
-            <li><a href="docs/collect.md">Collect data into Clojure Repl</a></li>
+            <li><a href="docs/collect.md">Collecting Data from Spark Datasets</a></li>
         </ul>
       </td>
       <td>

diff --git a/docs/collect.md b/docs/collect.md
@@ -1,19 +1,16 @@
-# Collect data to the driver
-
+# Collecting Data from SparkDatasets
 
 So far we have used Spark functions which get executed on all nodes of the cluster by the Spark engine.
 Spark has a lot of different options to operate on data, but sometimes we want to manipulate
 the data differently in pure Clojure.
 
-This requires to move the data from the Spark workers into the driver node, on which the Clojure Repl is running. This is only useful and possible, if the data is **small**, and fits on the node.
-
-Geni offers several functions starting with `collect-` which transport the data to the driver and then into the Clojure Repl.
-
+This requires to move the data from the Spark workers into the driver node, on which the Clojure REPL is running. This is only useful and possible, if the data is **small**, and fits on the node.
 
+Geni offers several functions starting with `collect-` which transport the data to the driver and then into the Clojure REPL.
 
 ## Collect as Clojure data
 
-A very common case is to access the data as Clojure maps with `collect`
+A very common case is to access the data as Clojure maps with `collect`:
 
 ```clojure
 
@@ -41,28 +38,23 @@ A very common case is to access the data as Clojure maps with `collect`
     )
 ```
 
-Alternatively we can get the data as a sequence of vectors with `collect-vals`
+Alternatively we can get the data as a sequence of vectors with `collect-vals`:
 
 ```clojure
 
 (-> fixed-df (g/limit 2) g/collect-vals)
-=>
-(["01/01/2012" 35 nil 0 38 51 26 10 16 nil]
-["02/01/2012" 83 nil 1 68 153 53 6 43 nil])
-
-
+=> (["01/01/2012" 35 nil 0 38 51 26 10 16 nil]
+    ["02/01/2012" 83 nil 1 68 153 53 6 43 nil])
 ```
 
 To access a the values of a single column, we can use `collect-col`:
 
 ```clojure
- (-> fixed-df (g/limit 2) (g/collect-col :Date))
- =>
-("01/01/2012" "02/01/2012")
-
+(-> fixed-df (g/limit 2) (g/collect-col :Date))
+=> ("01/01/2012" "02/01/2012")
 ```
 
-## Collect as arrow files
+## Collect as Arrow files
 
 We can get the data into the driver as arrow files as well, by using the function `collect-to-arrow`
 This has the advantage, that it can work with data larger then the heap space of the driver.
@@ -81,25 +73,21 @@ The function returns a sequence of file names created.
 
 ```clojure
 (-> fixed-df 
-  (g/repartition 20)      ;; split data in 20 partitions of equal size
-  (g/collect-to-arrow 100 "/tmp"))
-=>  
-["/tmp/geni12331590604347994819.ipc" "/tmp/geni2107925719499812901.ipc" "/tmp/geni2239579196531625282.ipc" "/tmp/geni14530350610103010872.ipc"]
-
-
+    (g/repartition 20)      ;; split data in 20 partitions of equal size
+    (g/collect-to-arrow 100 "/tmp"))
+=>  ["/tmp/geni12331590604347994819.ipc" 
+     "/tmp/geni2107925719499812901.ipc" 
+     "/tmp/geni2239579196531625282.ipc" 
+     "/tmp/geni14530350610103010872.ipc"]
 ```
 
 Setting the number of partitions and chunk size small enough, should allow the transfer of arbitrary large data to the driver. But it can obviously become slow, if data is big.
 
 The files are written in the arrow-stream format, which can be processed by other software packages or with the Clojure "tech.ml.dataset" library.
 
-
-
 ## Integration with tech.ml.dataset
 
 The very latest alpha version of tech.ml.dataset (tech.ml.dataset)[https://github.com/techascent/tech.ml.dataset]
 offers a deeper integration with Geni, and allows to convert a Spark data-frame directly into a tech.ml.dataset. This happens on the driver, so the data need to fit in heap space.
 
- See (here)[https://github.com/techascent/tech.ml.dataset/blob/43f411d224a50057ae7d8817d89eda3611b33115/src/tech/v3/libs/spark.clj#L191]
-
-for details.
+See (here)[https://github.com/techascent/tech.ml.dataset/blob/43f411d224a50057ae7d8817d89eda3611b33115/src/tech/v3/libs/spark.clj#L191] for details.
diff --git a/docs/creating_spark_schemas.md b/docs/creating_spark_schemas.md
@@ -0,0 +1,57 @@
+# Creating Spark Schemas
+
+Schema creation is typically required for [manual Dataset creation](manual_dataset_creation.md) and for having more control when loading a Dataset from file.
+
+One way to create a Spark schema is to use the Geni API that closely mimics the original Scala Spark API using **Spark DataTypes**. That is, the following Scala version:
+
+```scala
+StructType(Array(
+    StructField("a", IntegerType, true),
+    StructField("b", StringType, true),
+    StructField("c", ArrayType(ShortType, true), true),
+    StructField("d", MapType(StringType, IntegerType, true), true),
+    StructField(
+        "e",
+        StructType(Array(
+            StructField("x", FloatType, true),
+            StructField("y", DoubleType, true)
+        )),
+        true
+    )
+))
+```
+
+gets translated into:
+
+```clojure
+(g/struct-type
+ (g/struct-field :a :int true)
+ (g/struct-field :b :str true)
+ (g/struct-field :c (g/array-type :short true) true)
+ (g/struct-field :d (g/map-type :str :int) true)
+ (g/struct-field :e 
+                 (g/struct-type 
+                  (g/struct-field :x :float true) 
+                  (g/struct-field :y :float true))
+                 true))
+```
+
+whilst the Clojure version may look cleaner than the original Scala version, Geni offers an even more concise way to specify complex schemas such as the example above and cut through the boilerplates. In particular, we can use Geni's **data-oriented schemas**:
+
+```clojure
+{:a :int
+ :b :str
+ :c [:short]
+ :d [:str :int]
+ :z {:a :float :b :double}}
+```
+
+The conversion rules are simple:
+
+* all fields and types default to nullable;
+* a vector of count one is interpreted as an `ArrayType`;
+* a vector of count two is interpreted as a `MapType`;
+* a map is interpreted as a nested `StructType`; and
+* everything else is left as is.
+
+In particular, the last rule allows us to mix and match the data-oriented style with the Spark DataType style for specifying nested types.
diff --git a/src/clojure/zero_one/geni/core/dataset_creation.clj b/src/clojure/zero_one/geni/core/dataset_creation.clj
@@ -26,6 +26,7 @@
    :long      DataTypes/LongType
    :nil       DataTypes/NullType
    :short     DataTypes/ShortType
+   :str       DataTypes/StringType
    :string    DataTypes/StringType
    :timestamp DataTypes/TimestampType
    :vector    (VectorUDT.)

diff --git a/test/zero_one/geni/dataset_creation_test.clj b/test/zero_one/geni/dataset_creation_test.clj
@@ -31,7 +31,7 @@
     [(g/row 32 "horse" (g/dense 1.0 2.0) (g/sparse 4 [1 3] [3.0 4.0]))
      (g/row 64 "mouse" (g/dense 3.0 4.0) (g/sparse 4 [0 2] [1.0 2.0]))]
     {:number :int
-     :word :string
+     :word :str
      :dense :vector
      :sparse :vector}))
   => #(and (= (:number %) "IntegerType")