Skip to content

Commit

Permalink
Docs for data oriented schema (#290)
Browse files Browse the repository at this point in the history
* Started doc on schemas

* First draft of schema doc

* Proofread data-oriented schema doc

* Made type-shorthand more consistent
  • Loading branch information
anthony-khong committed Nov 10, 2020
1 parent 80f7835 commit bd71983
Show file tree
Hide file tree
Showing 5 changed files with 78 additions and 31 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ Geni provides an idiomatic Spark interface for Clojure without the hassle of Jav
<li><a href="docs/simple_performance_benchmark.md">A Simple Performance Benchmark</a></li>
<li><a href="CODE_OF_CONDUCT.md">Code of Conduct</a></li>
<li><a href="CONTRIBUTING.md">Contributing Guide</a></li>
<li><a href="docs/creating_spark_schemas.md">Creating Spark Schemas</a></li>
<li><a href="docs/examples.md">Examples</a></li>
<li><a href="docs/design_goals.md">Design Goals</a></li>
<li><a href="docs/semantics.md">Geni Semantics</a></li>
Expand All @@ -42,7 +43,7 @@ Geni provides an idiomatic Spark interface for Clojure without the hassle of Jav
<li><a href="docs/spark_session.md">Where's The Spark Session</a></li>
<li><a href="docs/why.md">Why?</a></li>
<li><a href="docs/sql_maps.md">Working with SQL Maps</a></li>
<li><a href="docs/collect.md">Collect data into Clojure Repl</a></li>
<li><a href="docs/collect.md">Collecting Data from Spark Datasets</a></li>
</ul>
</td>
<td>
Expand Down
46 changes: 17 additions & 29 deletions docs/collect.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,16 @@
# Collect data to the driver

# Collecting Data from SparkDatasets

So far we have used Spark functions which get executed on all nodes of the cluster by the Spark engine.
Spark has a lot of different options to operate on data, but sometimes we want to manipulate
the data differently in pure Clojure.

This requires to move the data from the Spark workers into the driver node, on which the Clojure Repl is running. This is only useful and possible, if the data is **small**, and fits on the node.

Geni offers several functions starting with `collect-` which transport the data to the driver and then into the Clojure Repl.

This requires to move the data from the Spark workers into the driver node, on which the Clojure REPL is running. This is only useful and possible, if the data is **small**, and fits on the node.

Geni offers several functions starting with `collect-` which transport the data to the driver and then into the Clojure REPL.

## Collect as Clojure data

A very common case is to access the data as Clojure maps with `collect`
A very common case is to access the data as Clojure maps with `collect`:

```clojure

Expand Down Expand Up @@ -41,28 +38,23 @@ A very common case is to access the data as Clojure maps with `collect`
)
```

Alternatively we can get the data as a sequence of vectors with `collect-vals`
Alternatively we can get the data as a sequence of vectors with `collect-vals`:

```clojure

(-> fixed-df (g/limit 2) g/collect-vals)
=>
(["01/01/2012" 35 nil 0 38 51 26 10 16 nil]
["02/01/2012" 83 nil 1 68 153 53 6 43 nil])


=> (["01/01/2012" 35 nil 0 38 51 26 10 16 nil]
["02/01/2012" 83 nil 1 68 153 53 6 43 nil])
```

To access a the values of a single column, we can use `collect-col`:

```clojure
(-> fixed-df (g/limit 2) (g/collect-col :Date))
=>
("01/01/2012" "02/01/2012")

(-> fixed-df (g/limit 2) (g/collect-col :Date))
=> ("01/01/2012" "02/01/2012")
```

## Collect as arrow files
## Collect as Arrow files

We can get the data into the driver as arrow files as well, by using the function `collect-to-arrow`
This has the advantage, that it can work with data larger then the heap space of the driver.
Expand All @@ -81,25 +73,21 @@ The function returns a sequence of file names created.

```clojure
(-> fixed-df
(g/repartition 20) ;; split data in 20 partitions of equal size
(g/collect-to-arrow 100 "/tmp"))
=>
["/tmp/geni12331590604347994819.ipc" "/tmp/geni2107925719499812901.ipc" "/tmp/geni2239579196531625282.ipc" "/tmp/geni14530350610103010872.ipc"]


(g/repartition 20) ;; split data in 20 partitions of equal size
(g/collect-to-arrow 100 "/tmp"))
=> ["/tmp/geni12331590604347994819.ipc"
"/tmp/geni2107925719499812901.ipc"
"/tmp/geni2239579196531625282.ipc"
"/tmp/geni14530350610103010872.ipc"]
```

Setting the number of partitions and chunk size small enough, should allow the transfer of arbitrary large data to the driver. But it can obviously become slow, if data is big.

The files are written in the arrow-stream format, which can be processed by other software packages or with the Clojure "tech.ml.dataset" library.



## Integration with tech.ml.dataset

The very latest alpha version of tech.ml.dataset (tech.ml.dataset)[https://github.com/techascent/tech.ml.dataset]
offers a deeper integration with Geni, and allows to convert a Spark data-frame directly into a tech.ml.dataset. This happens on the driver, so the data need to fit in heap space.

See (here)[https://github.com/techascent/tech.ml.dataset/blob/43f411d224a50057ae7d8817d89eda3611b33115/src/tech/v3/libs/spark.clj#L191]

for details.
See (here)[https://github.com/techascent/tech.ml.dataset/blob/43f411d224a50057ae7d8817d89eda3611b33115/src/tech/v3/libs/spark.clj#L191] for details.
57 changes: 57 additions & 0 deletions docs/creating_spark_schemas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Creating Spark Schemas

Schema creation is typically required for [manual Dataset creation](manual_dataset_creation.md) and for having more control when loading a Dataset from file.

One way to create a Spark schema is to use the Geni API that closely mimics the original Scala Spark API using **Spark DataTypes**. That is, the following Scala version:

```scala
StructType(Array(
StructField("a", IntegerType, true),
StructField("b", StringType, true),
StructField("c", ArrayType(ShortType, true), true),
StructField("d", MapType(StringType, IntegerType, true), true),
StructField(
"e",
StructType(Array(
StructField("x", FloatType, true),
StructField("y", DoubleType, true)
)),
true
)
))
```

gets translated into:

```clojure
(g/struct-type
(g/struct-field :a :int true)
(g/struct-field :b :str true)
(g/struct-field :c (g/array-type :short true) true)
(g/struct-field :d (g/map-type :str :int) true)
(g/struct-field :e
(g/struct-type
(g/struct-field :x :float true)
(g/struct-field :y :float true))
true))
```

whilst the Clojure version may look cleaner than the original Scala version, Geni offers an even more concise way to specify complex schemas such as the example above and cut through the boilerplates. In particular, we can use Geni's **data-oriented schemas**:

```clojure
{:a :int
:b :str
:c [:short]
:d [:str :int]
:z {:a :float :b :double}}
```

The conversion rules are simple:

* all fields and types default to nullable;
* a vector of count one is interpreted as an `ArrayType`;
* a vector of count two is interpreted as a `MapType`;
* a map is interpreted as a nested `StructType`; and
* everything else is left as is.

In particular, the last rule allows us to mix and match the data-oriented style with the Spark DataType style for specifying nested types.
1 change: 1 addition & 0 deletions src/clojure/zero_one/geni/core/dataset_creation.clj
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
:long DataTypes/LongType
:nil DataTypes/NullType
:short DataTypes/ShortType
:str DataTypes/StringType
:string DataTypes/StringType
:timestamp DataTypes/TimestampType
:vector (VectorUDT.)
Expand Down
2 changes: 1 addition & 1 deletion test/zero_one/geni/dataset_creation_test.clj
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
[(g/row 32 "horse" (g/dense 1.0 2.0) (g/sparse 4 [1 3] [3.0 4.0]))
(g/row 64 "mouse" (g/dense 3.0 4.0) (g/sparse 4 [0 2] [1.0 2.0]))]
{:number :int
:word :string
:word :str
:dense :vector
:sparse :vector}))
=> #(and (= (:number %) "IntegerType")
Expand Down

0 comments on commit bd71983

Please sign in to comment.