pyspark.sql.functions.lag#

pyspark.sql.functions.lag(col, offset=1, default=None)[source]#

Window function: returns the value that is offset rows before the current row, and default if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition.

This is equivalent to the LAG function in SQL.

New in version 1.4.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
colColumn or str

name of column or expression

offsetint, optional default 1

number of row to extend

defaultoptional

default value

Returns
Column

value before current row based on offset.

Examples

>>> from pyspark.sql import Window
>>> df = spark.createDataFrame([("a", 1),
...                             ("a", 2),
...                             ("a", 3),
...                             ("b", 8),
...                             ("b", 2)], ["c1", "c2"])
>>> df.show()
+---+---+
| c1| c2|
+---+---+
|  a|  1|
|  a|  2|
|  a|  3|
|  b|  8|
|  b|  2|
+---+---+
>>> w = Window.partitionBy("c1").orderBy("c2")
>>> df.withColumn("previos_value", lag("c2").over(w)).show()
+---+---+-------------+
| c1| c2|previos_value|
+---+---+-------------+
|  a|  1|         NULL|
|  a|  2|            1|
|  a|  3|            2|
|  b|  2|         NULL|
|  b|  8|            2|
+---+---+-------------+
>>> df.withColumn("previos_value", lag("c2", 1, 0).over(w)).show()
+---+---+-------------+
| c1| c2|previos_value|
+---+---+-------------+
|  a|  1|            0|
|  a|  2|            1|
|  a|  3|            2|
|  b|  2|            0|
|  b|  8|            2|
+---+---+-------------+
>>> df.withColumn("previos_value", lag("c2", 2, -1).over(w)).show()
+---+---+-------------+
| c1| c2|previos_value|
+---+---+-------------+
|  a|  1|           -1|
|  a|  2|           -1|
|  a|  3|            1|
|  b|  2|           -1|
|  b|  8|           -1|
+---+---+-------------+