Support UPDATE output mode for Spark Structured Streaming #1123

jbaiera · 2018-03-17T00:14:50Z

We now support Spark Structured Streaming as of 6.x, but only in APPEND mode. It would be beneficial to support UPDATE mode, with the requirement that the user MUST provide a field to be used for a document ID.

The text was updated successfully, but these errors were encountered:

kant111 · 2018-03-19T07:33:20Z

can we make this backward compatible also? That way users can use this feature along with ES >= 5.5 or something

jbaiera · 2018-03-19T15:08:16Z

While the project's official stance is that we maintain BWC with the previous major's most recent minor release, we give it our best effort to maintain backwards compatibility with as many versions of Elasticsearch as we can.

kant111 · 2018-03-20T02:13:04Z

Any sort of timeline or release number on this?

jbaiera · 2018-03-20T15:19:07Z

We usually do not commit to timelines or release targets on public issues.

kant111 · 2018-05-01T23:59:41Z

@jbaiera can I use update mode and forEachSink to write to ES for now?

jbaiera · 2018-05-02T18:58:30Z

That seems like a reasonable work around in the mean time.

kant111 · 2018-05-02T19:03:25Z

@jbaiera Thanks. Can you please provide some more details on how to do that? For example, should I use elastic-hadoop connector or elastic search java driver? Any examples because I dont see in docs on how to do this?

I am trying to write streaming dataset to elastic search in update mode.

currently using Spark 2.3.0 and E.S 5.3.2

jbaiera · 2018-05-02T19:10:48Z

This very much depends on your use case. In my experience Spark does not offer much in terms of solid documentation for writing your own sinks. Much of the current implementation for ES-Hadoop was the product of a month of trial and error and reverse engineering existing sink implementations. If you don't much care about the expectations around transaction acknowledgement and skipping acked transactions during a recovery, then the basic Foreach sink should be useable for your case.

I understand that the current java rest client can sometimes cause issues in Hadoop/Spark environments because of clashing dependencies. ES-Hadoop has its own Rest Client implementation that it uses to get around this. I don't recommend using the ES-Hadoop internals to build this as they are not covered by our semantic versioning scheme, and may break between any release with no notice. Because of this, I recommend using the available clients for Elasticsearch if you are going to build something as a work around.

kant111 · 2018-05-02T19:23:38Z

@jbaiera 1) I don't see any REST client for Elastic search 5.3.2. 2) when I used the BulkProcessor from the transport java driver to write to ES I get task not serializable exception.

Here is the code

public class EsSink extends ForeachWriter<Row> {

    private TransportClient client;
    private BulkProcessor bulkProcessor;

    public EsSink(String cluster, String host, int port) throws UnknownHostException {
        Settings settings = Settings.builder()
                .put("cluster.name", cluster).build();
        String[] elasticSearchIps = host.split(",");
        InetSocketTransportAddress[] inetSocketTransportAddresses = new InetSocketTransportAddress[elasticSearchIps.length];
        for (int i = 0; i < elasticSearchIps.length; i++) {
            inetSocketTransportAddresses[i] = new InetSocketTransportAddress(InetAddress.getByName(elasticSearchIps[i]), port);
        }
        this.client = new PreBuiltTransportClient(settings)
                .addTransportAddresses(inetSocketTransportAddresses);

        this.bulkProcessor = BulkProcessor.builder(
                client,
                new BulkProcessor.Listener() {
                    @Override
                    public void beforeBulk(long executionId,
                                           BulkRequest request) {}

                    @Override
                    public void afterBulk(long executionId,
                                          BulkRequest request,
                                          BulkResponse response) {}

                    @Override
                    public void afterBulk(long executionId,
                                          BulkRequest request,
                                          Throwable failure) {}
                })
                .setBulkActions(10000)
                .setBulkSize(new ByteSizeValue(5, ByteSizeUnit.MB))
                .setFlushInterval(TimeValue.timeValueSeconds(5))
                .setConcurrentRequests(1)
                .setBackoffPolicy(
                        BackoffPolicy.exponentialBackoff(TimeValue.timeValueMillis(100), 3))
                .build();
    }

    @Override
    public boolean open(long l, long l1) {
        return true;
    }

    @Override
    public void process(Row row) {
        String[] fieldNames = row.schema().fieldNames();
        Seq<String> fieldNamesSeq = JavaConverters.asScalaIteratorConverter(Arrays.asList(fieldNames).iterator()).asScala().toSeq();
        String jsonDocument = row.getValuesMap(fieldNamesSeq).toString();
        IndexRequest indexRequest = Requests.indexRequest("hello").type("foo").id(row.get("id")).source(jsonDocument, XContentType.JSON);
        this.bulkProcessor.add(indexRequest);
    }

    @Override
    public void close(Throwable throwable) {
        this.bulkProcessor.close();
        this.client.close();
    }
}

jbaiera · 2018-05-03T21:09:10Z

If you are using the Transport client you will need to construct the client objects lazily and mark them as transient as they cannot be serialized from the driver to the cluster.

kant111 · 2018-05-19T11:39:44Z

ok! got it. Btw, I haven't implemented a custom sink before but I looked at the Es-Hadoop-connector code to see how hard would it be to implement the update mode. I feel like append mode code already establishes a good code structure in terms of implementing the right interfaces and so on. so my question really now is what changes or how big of a change I would need to make if I were to implement the update mode?

kant111 · 2018-05-20T11:24:07Z

The interface below is already implemented in the ES-Hadoop-Connector and output modes are taken care by spark other words spark sends whatever output mode a user sets so It looks to me that we just need to change this one IF but at the same time I feel thats too good to be true?

trait Sink {
  def addBatch(batchId: Long, data: DataFrame): Unit
}

jbaiera · 2018-05-21T19:35:44Z

@kant111 I'm happy to see that you've taken in interest in the code. That if-statement indeed does block off the option itself from being used, but it doesn't do anything to make sure that the Sink implementation observes the invariants set forth in the UPDATE mode, or follows what a user might expect from UPDATE mode in regards to Elasticsearch. The following things would need to be in place to accept working with UPDATE mode:

First, in UPDATE mode, the underlying connector capabilities should be fine as long as it is using the upsert method of ingestion to ES. Alongside this mode, a field must be marked as an ID in the configuration. To support UPDATE mode, it should be checked that these settings are present, or even to set them in the connector at this point, throwing an error if they have values already that are incompatible.

Second, for us to support UPDATE mode, we need sufficient testing to make sure that the connector is performing the required operations, and can see the expected outcomes. Building these tests can be time consuming, and thus they were left off of the initial implementation.

Finally, there is no plan to support any other output modes than APPEND and (eventually) UPDATE. The COMPLETE output mode (and any future ones) would need to be blocked off since Elasticsearch would not support that kind of output mode.

soyme · 2018-11-16T04:21:53Z

I have been waiting for supporting UPDATE output mode.. 😢 Do you have any release schedule?

toddleo · 2019-05-27T10:13:41Z

Bump. Is there any progress here or timetable after a long wait? 😢

masseyke · 2021-12-29T22:20:58Z

If anyone on this ticket is still interested (I know it's been a long wait), I've got a draft PR up for this. I'd appreciate any feedback before we finalize it. See #1839

This commit adds support for "update" as the output mode for spark structured streaming to Elasticsearch. Closes #1123

jbaiera added feature :Spark labels Mar 17, 2018

masseyke mentioned this issue Dec 29, 2021

Adding support for update output mode to structured streaming #1839

Merged

masseyke closed this as completed in #1839 Jan 20, 2022

masseyke added a commit that referenced this issue Jan 20, 2022

Adding support for update output mode to structured streaming (#1839)

3ef547d

This commit adds support for "update" as the output mode for spark structured streaming to Elasticsearch. Closes #1123

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support UPDATE output mode for Spark Structured Streaming #1123

Support UPDATE output mode for Spark Structured Streaming #1123

jbaiera commented Mar 17, 2018

kant111 commented Mar 19, 2018 •

edited

Loading

Uh oh!

jbaiera commented Mar 19, 2018

Uh oh!

kant111 commented Mar 20, 2018

Uh oh!

jbaiera commented Mar 20, 2018

Uh oh!

kant111 commented May 1, 2018

Uh oh!

jbaiera commented May 2, 2018

Uh oh!

kant111 commented May 2, 2018 •

edited

Loading

Uh oh!

jbaiera commented May 2, 2018

Uh oh!

kant111 commented May 2, 2018 •

edited

Loading

Uh oh!

jbaiera commented May 3, 2018

Uh oh!

kant111 commented May 19, 2018 •

edited

Loading

Uh oh!

kant111 commented May 20, 2018

Uh oh!

jbaiera commented May 21, 2018

Uh oh!

soyme commented Nov 16, 2018

Uh oh!

toddleo commented May 27, 2019

Uh oh!

masseyke commented Dec 29, 2021

Uh oh!

Support UPDATE output mode for Spark Structured Streaming #1123

Support UPDATE output mode for Spark Structured Streaming #1123

Comments

jbaiera commented Mar 17, 2018

kant111 commented Mar 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbaiera commented Mar 19, 2018

Uh oh!

kant111 commented Mar 20, 2018

Uh oh!

jbaiera commented Mar 20, 2018

Uh oh!

kant111 commented May 1, 2018

Uh oh!

jbaiera commented May 2, 2018

Uh oh!

kant111 commented May 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbaiera commented May 2, 2018

Uh oh!

kant111 commented May 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbaiera commented May 3, 2018

Uh oh!

kant111 commented May 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kant111 commented May 20, 2018

Uh oh!

jbaiera commented May 21, 2018

Uh oh!

soyme commented Nov 16, 2018

Uh oh!

toddleo commented May 27, 2019

Uh oh!

masseyke commented Dec 29, 2021

Uh oh!

kant111 commented Mar 19, 2018 •

edited

Loading

kant111 commented May 2, 2018 •

edited

Loading

kant111 commented May 2, 2018 •

edited

Loading

kant111 commented May 19, 2018 •

edited

Loading