Skip to content

Commit bb200ab

Browse files
authored
Manual testing of Morpheus with Kafka & Validation improvements (#290)
* instructions for manually testing of Morpheus using Kafka. Adds a Kafka version for each of the four validation scripts in `scripts/validation` * csv & json serializers now support an `include_index_col` flag to control exporting the Dataframe's index column. Note due to a limitation of cudf & pandas this has no impact on JSON: + pandas-dev/pandas#37600 + rapidsai/cudf#11317 * `morpheus.utils.logging` renamed to `morpheus.utils.logger` so that other modules in `morpheus.utils` can import the standard lib logging module. * Comparison logic in the `ValidationStage` has been moved to it's own module `morpheus.utils.compare_df` so that the functionality can be used outside of the stage. fixes #265 Authors: - David Gardner (https://github.com/dagardner-nv) Approvers: - Pete MacKinnon (https://github.com/pdmack) - Michael Demoret (https://github.com/mdemoret-nv) URL: #290
1 parent a47cc65 commit bb200ab

File tree

24 files changed

+806
-143
lines changed

24 files changed

+806
-143
lines changed

CONTRIBUTING.md

Lines changed: 21 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -232,17 +232,24 @@ Launching a full production Kafka cluster is outside the scope of this project.
232232
```bash
233233
export KAFKA_ADVERTISED_HOST_NAME=$(docker network inspect bridge | jq -r '.[0].IPAM.Config[0].Gateway')
234234
```
235-
5. Update the `kafka-docker/docker-compose.yml` so the environment variable `KAFKA_ADVERTISED_HOST_NAME` matches the previous step. For example, the line should look like:
235+
5. Update the `kafka-docker/docker-compose.yml`, performing two changes:
236+
1. Update the `ports` entry to:
237+
```yml
238+
ports:
239+
- "0.0.0.0::9092"
240+
```
241+
This will prevent the containers from attempting to map IPv6 ports.
242+
1. Change the value of `KAFKA_ADVERTISED_HOST_NAME` to match the value of the `KAFKA_ADVERTISED_HOST_NAME` environment variable from the previous step. For example, the line should look like:
236243

237-
```yml
238-
environment:
239-
KAFKA_ADVERTISED_HOST_NAME: 172.17.0.1
240-
```
241-
Which should match the value of `$KAFKA_ADVERTISED_HOST_NAME` from the previous step:
244+
```yml
245+
environment:
246+
KAFKA_ADVERTISED_HOST_NAME: 172.17.0.1
247+
```
248+
Which should match the value of `$KAFKA_ADVERTISED_HOST_NAME` from the previous step:
242249

243-
```bash
244-
$ echo $KAFKA_ADVERTISED_HOST_NAME
245-
"172.17.0.1"
250+
```bash
251+
$ echo $KAFKA_ADVERTISED_HOST_NAME
252+
"172.17.0.1"
246253
```
247254
6. Launch kafka with 3 instances:
248255

@@ -252,11 +259,14 @@ Launching a full production Kafka cluster is outside the scope of this project.
252259
In practice, 3 instances has been shown to work well. Use as many instances as required. Keep in mind each instance takes about 1 Gb of memory.
253260
7. Launch the Kafka shell
254261
1. To configure the cluster, you will need to launch into a container that has the Kafka shell.
255-
2. You can do this with `./start-kafka-shell.sh $KAFKA_ADVERTISED_HOST_NAME`.
262+
2. You can do this with:
263+
```bash
264+
./start-kafka-shell.sh $KAFKA_ADVERTISED_HOST_NAME
265+
```
256266
3. However, this makes it difficult to load data into the cluster. Instead, you can manually launch the Kafka shell by running:
257267
```bash
258268
# Change to the morpheus root to make it easier for mounting volumes
259-
cd ${MORPHEUS_HOME}
269+
cd ${MORPHEUS_ROOT}
260270
261271
# Run the Kafka shell docker container
262272
docker run --rm -it -v /var/run/docker.sock:/var/run/docker.sock \

docs/source/developer_guide/guides/1_simple_python_stage.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@ from morpheus.config import Config
124124
from morpheus.pipeline import LinearPipeline
125125
from morpheus.stages.general.monitor_stage import MonitorStage
126126
from morpheus.stages.input.file_source_stage import FileSourceStage
127-
from morpheus.utils.logging import configure_logging
127+
from morpheus.utils.logger import configure_logging
128128

129129
from pass_thru import PassThruStage
130130
```
@@ -185,7 +185,7 @@ from morpheus.config import Config
185185
from morpheus.pipeline import LinearPipeline
186186
from morpheus.stages.general.monitor_stage import MonitorStage
187187
from morpheus.stages.input.file_source_stage import FileSourceStage
188-
from morpheus.utils.logging import configure_logging
188+
from morpheus.utils.logger import configure_logging
189189

190190
from pass_thru import PassThruStage
191191

docs/source/developer_guide/guides/2_real_world_phishing.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -302,7 +302,7 @@ from morpheus.stages.postprocess.filter_detections_stage import FilterDetections
302302
from morpheus.stages.postprocess.serialize_stage import SerializeStage
303303
from morpheus.stages.preprocess.deserialize_stage import DeserializeStage
304304
from morpheus.stages.preprocess.preprocess_nlp_stage import PreprocessNLPStage
305-
from morpheus.utils.logging import configure_logging
305+
from morpheus.utils.logger import configure_logging
306306

307307
from recipient_features_stage import RecipientFeaturesStage
308308

examples/abp_pcap_detection/run.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@
3131
from morpheus.stages.postprocess.add_classifications_stage import AddClassificationsStage
3232
from morpheus.stages.postprocess.serialize_stage import SerializeStage
3333
from morpheus.stages.preprocess.deserialize_stage import DeserializeStage
34-
from morpheus.utils.logging import configure_logging
34+
from morpheus.utils.logger import configure_logging
3535

3636

3737
@click.command()

examples/gnn_fraud_detection_pipeline/run.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@
2626
from morpheus.stages.output.write_to_file_stage import WriteToFileStage
2727
from morpheus.stages.postprocess.serialize_stage import SerializeStage
2828
from morpheus.stages.preprocess.deserialize_stage import DeserializeStage
29-
from morpheus.utils.logging import configure_logging
29+
from morpheus.utils.logger import configure_logging
3030
from stages.classification_stage import ClassificationStage
3131
from stages.graph_construction_stage import FraudGraphConstructionStage
3232
from stages.graph_sage_stage import GraphSAGEStage

examples/ransomware_detection/run.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929
from morpheus.stages.output.write_to_file_stage import WriteToFileStage
3030
from morpheus.stages.postprocess.add_scores_stage import AddScoresStage
3131
from morpheus.stages.postprocess.serialize_stage import SerializeStage
32-
from morpheus.utils.logging import configure_logging
32+
from morpheus.utils.logger import configure_logging
3333
from stages.create_features import CreateFeaturesRWStage
3434
from stages.preprocessing import PreprocessingRWStage
3535

morpheus/_lib/include/morpheus/io/serializers.hpp

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,12 +25,14 @@
2525

2626
namespace morpheus {
2727

28-
std::string df_to_csv(const TableInfo& tbl, bool include_header);
28+
std::string df_to_csv(const TableInfo& tbl, bool include_header, bool include_index_col = true);
2929

30-
void df_to_csv(const TableInfo& tbl, std::ostream& out_stream, bool include_header);
30+
void df_to_csv(const TableInfo& tbl, std::ostream& out_stream, bool include_header, bool include_index_col = true);
3131

32-
std::string df_to_json(const TableInfo& tbl);
32+
// Note the include_index_col is currently being ignored in both versions of `df_to_json` due to a known issue in
33+
// Pandas: https://github.com/pandas-dev/pandas/issues/37600
34+
std::string df_to_json(const TableInfo& tbl, bool include_index_col = true);
3335

34-
void df_to_json(const TableInfo& tbl, std::ostream& out_stream);
36+
void df_to_json(const TableInfo& tbl, std::ostream& out_stream, bool include_index_col = true);
3537

3638
} // namespace morpheus

morpheus/_lib/include/morpheus/stages/write_to_file.hpp

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,8 @@ class WriteToFileStage : public srf::pysrf::PythonNode<std::shared_ptr<MessageMe
4949
*/
5050
WriteToFileStage(const std::string &filename,
5151
std::ios::openmode mode = std::ios::out,
52-
FileTypes file_type = FileTypes::Auto);
52+
FileTypes file_type = FileTypes::Auto,
53+
bool include_index_col = true);
5354

5455
private:
5556
/**
@@ -64,6 +65,7 @@ class WriteToFileStage : public srf::pysrf::PythonNode<std::shared_ptr<MessageMe
6465
subscribe_fn_t build_operator();
6566

6667
bool m_is_first;
68+
bool m_include_index_col;
6769
std::ofstream m_fstream;
6870
std::function<void(sink_type_t &)> m_write_func;
6971
};
@@ -81,7 +83,8 @@ struct WriteToFileStageInterfaceProxy
8183
const std::string &name,
8284
const std::string &filename,
8385
const std::string &mode = "w",
84-
FileTypes file_type = FileTypes::Auto);
86+
FileTypes file_type = FileTypes::Auto,
87+
bool include_index_col = true);
8588
};
8689

8790
#pragma GCC visibility pop

morpheus/_lib/src/io/serializers.cpp

Lines changed: 24 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -19,11 +19,15 @@
1919

2020
#include <cudf/io/csv.hpp>
2121
#include <cudf/io/data_sink.hpp>
22+
#include <cudf/table/table_view.hpp>
23+
#include <cudf/types.hpp>
2224
#include <pybind11/gil.h>
2325
#include <pybind11/pybind11.h>
2426
#include <pybind11/pytypes.h>
2527
#include <rmm/mr/device/per_device_resource.hpp>
2628

29+
#include <memory>
30+
#include <numeric>
2731
#include <ostream>
2832
#include <sstream>
2933

@@ -75,38 +79,48 @@ class OStreamSink : public cudf::io::data_sink
7579
size_t m_bytest_written{0};
7680
};
7781

78-
std::string df_to_csv(const TableInfo& tbl, bool include_header)
82+
std::string df_to_csv(const TableInfo& tbl, bool include_header, bool include_index_col)
7983
{
8084
// Create an ostringstream and use that with the overload accepting an ostream
8185
std::ostringstream out_stream;
8286

83-
df_to_csv(tbl, out_stream, include_header);
87+
df_to_csv(tbl, out_stream, include_header, include_index_col);
8488

8589
return out_stream.str();
8690
}
8791

88-
void df_to_csv(const TableInfo& tbl, std::ostream& out_stream, bool include_header)
92+
void df_to_csv(const TableInfo& tbl, std::ostream& out_stream, bool include_header, bool include_index_col)
8993
{
94+
auto column_names = tbl.get_column_names();
95+
cudf::size_type start_col = 1;
96+
if (include_index_col)
97+
{
98+
start_col = 0;
99+
column_names.insert(column_names.begin(), ""s); // insert the id column
100+
}
101+
102+
std::vector<cudf::size_type> col_idexes(column_names.size());
103+
std::iota(col_idexes.begin(), col_idexes.end(), start_col);
104+
auto tbl_view = tbl.get_view().select(col_idexes);
105+
90106
OStreamSink sink(out_stream);
91107
auto destination = cudf::io::sink_info(&sink);
92-
auto options_builder = cudf::io::csv_writer_options_builder(destination, tbl.get_view())
108+
auto options_builder = cudf::io::csv_writer_options_builder(destination, tbl_view)
93109
.include_header(include_header)
94110
.true_value("True"s)
95111
.false_value("False"s);
96112

97113
cudf::io::table_metadata metadata{};
98114
if (include_header)
99115
{
100-
auto column_names = tbl.get_column_names();
101-
column_names.insert(column_names.begin(), ""s); // insert the id column
102116
metadata.column_names = column_names;
103117
options_builder = options_builder.metadata(&metadata);
104118
}
105119

106120
cudf::io::write_csv(options_builder.build(), rmm::mr::get_current_device_resource());
107121
}
108122

109-
std::string df_to_json(const TableInfo& tbl)
123+
std::string df_to_json(const TableInfo& tbl, bool include_index_col)
110124
{
111125
std::string results;
112126
// no cpp impl for to_json, instead python module converts to pandas and calls to_json
@@ -116,7 +130,7 @@ std::string df_to_json(const TableInfo& tbl)
116130

117131
auto df = tbl.as_py_object();
118132
auto buffer = StringIO();
119-
py::dict kwargs = py::dict("orient"_a = "records", "lines"_a = true);
133+
py::dict kwargs = py::dict("orient"_a = "records", "lines"_a = true, "index"_a = include_index_col);
120134
df.attr("to_json")(buffer, **kwargs);
121135
buffer.attr("seek")(0);
122136

@@ -127,11 +141,11 @@ std::string df_to_json(const TableInfo& tbl)
127141
return results;
128142
}
129143

130-
void df_to_json(const TableInfo& tbl, std::ostream& out_stream)
144+
void df_to_json(const TableInfo& tbl, std::ostream& out_stream, bool include_index_col)
131145
{
132146
// Unlike df_to_csv, we use the ostream overload to call the string overload because there is no C++ implementation
133147
// of to_json
134-
std::string output = df_to_json(tbl);
148+
std::string output = df_to_json(tbl, include_index_col);
135149

136150
// Now write the contents to the stream
137151
out_stream.write(output.data(), output.size());

morpheus/_lib/src/python_modules/stages.cpp

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -168,8 +168,9 @@ PYBIND11_MODULE(stages, m)
168168
py::arg("builder"),
169169
py::arg("name"),
170170
py::arg("filename"),
171-
py::arg("mode") = "w",
172-
py::arg("file_type") = 0); // Setting this to FileTypes::AUTO throws a conversion error at runtime
171+
py::arg("mode") = "w",
172+
py::arg("file_type") = 0, // Setting this to FileTypes::AUTO throws a conversion error at runtime
173+
py::arg("include_index_col") = true);
173174

174175
#ifdef VERSION_INFO
175176
m.attr("__version__") = MACRO_STRINGIFY(VERSION_INFO);

morpheus/_lib/src/stages/write_to_file.cpp

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,13 @@
2727
namespace morpheus {
2828
// Component public implementations
2929
// ************ WriteToFileStage **************************** //
30-
WriteToFileStage::WriteToFileStage(const std::string &filename, std::ios::openmode mode, FileTypes file_type) :
30+
WriteToFileStage::WriteToFileStage(const std::string &filename,
31+
std::ios::openmode mode,
32+
FileTypes file_type,
33+
bool include_index_col) :
3134
PythonNode(base_t::op_factory_from_sub_fn(build_operator())),
32-
m_is_first(true)
35+
m_is_first(true),
36+
m_include_index_col(include_index_col)
3337
{
3438
if (file_type == FileTypes::Auto)
3539
{
@@ -59,13 +63,13 @@ WriteToFileStage::WriteToFileStage(const std::string &filename, std::ios::openmo
5963
void WriteToFileStage::write_json(WriteToFileStage::sink_type_t &msg)
6064
{
6165
// Call df_to_json passing our fstream
62-
df_to_json(msg->get_info(), m_fstream);
66+
df_to_json(msg->get_info(), m_fstream, m_include_index_col);
6367
}
6468

6569
void WriteToFileStage::write_csv(WriteToFileStage::sink_type_t &msg)
6670
{
6771
// Call df_to_csv passing our fstream
68-
df_to_csv(msg->get_info(), m_fstream, m_is_first);
72+
df_to_csv(msg->get_info(), m_fstream, m_is_first, m_include_index_col);
6973
}
7074

7175
void WriteToFileStage::close()
@@ -102,7 +106,8 @@ std::shared_ptr<srf::segment::Object<WriteToFileStage>> WriteToFileStageInterfac
102106
const std::string &name,
103107
const std::string &filename,
104108
const std::string &mode,
105-
FileTypes file_type)
109+
FileTypes file_type,
110+
bool include_index_col)
106111
{
107112
std::ios::openmode fsmode = std::ios::out;
108113

@@ -138,7 +143,7 @@ std::shared_ptr<srf::segment::Object<WriteToFileStage>> WriteToFileStageInterfac
138143
throw std::runtime_error(std::string("Unsupported file mode. Must choose either 'w' or 'a'. Mode: ") + mode);
139144
}
140145

141-
auto stage = builder.construct_object<WriteToFileStage>(name, filename, fsmode, file_type);
146+
auto stage = builder.construct_object<WriteToFileStage>(name, filename, fsmode, file_type, include_index_col);
142147

143148
return stage;
144149
}

morpheus/cli.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030
from morpheus.config import CppConfig
3131
from morpheus.config import PipelineModes
3232
from morpheus.config import auto_determine_bootstrap
33-
from morpheus.utils.logging import configure_logging
33+
from morpheus.utils.logger import configure_logging
3434

3535
# pylint: disable=line-too-long, import-outside-toplevel, invalid-name, global-at-module-level, unused-argument
3636

@@ -1325,6 +1325,12 @@ def validate(ctx: click.Context, **kwargs):
13251325
@click.command(short_help="Write all messages to a file", **command_kwargs)
13261326
@click.option('--filename', type=click.Path(writable=True), required=True, help="The file to write to")
13271327
@click.option('--overwrite', is_flag=True, help="Whether or not to overwrite the target file")
1328+
@click.option('--include-index-col',
1329+
'include_index_col',
1330+
default=True,
1331+
type=bool,
1332+
help=("Includes dataframe's index column in the output "
1333+
"Note: this currently only works for CSV file output"))
13281334
@prepare_command()
13291335
def to_file(ctx: click.Context, **kwargs):
13301336

morpheus/io/serializers.py

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,10 @@
1919
import cudf
2020

2121

22-
def df_to_csv(df: cudf.DataFrame, include_header=False, strip_newline=False) -> typing.List[str]:
22+
def df_to_csv(df: cudf.DataFrame,
23+
include_header=False,
24+
strip_newline=False,
25+
include_index_col=True) -> typing.List[str]:
2326
"""
2427
Serializes a DataFrame into CSV and returns the serialized output seperated by lines.
2528
@@ -31,13 +34,15 @@ def df_to_csv(df: cudf.DataFrame, include_header=False, strip_newline=False) ->
3134
Whether or not to include the header, by default False.
3235
strip_newline : bool, optional
3336
Whether or not to strip the newline characters from each string, by default False.
37+
include_index_col: bool, optional
38+
Write out the index as a column, by default True.
3439
3540
Returns
3641
-------
3742
typing.List[str]
3843
List of strings for each line
3944
"""
40-
results = df.to_csv(header=include_header)
45+
results = df.to_csv(header=include_header, index=include_index_col)
4146
if strip_newline:
4247
results = results.split("\n")
4348
else:
@@ -46,7 +51,7 @@ def df_to_csv(df: cudf.DataFrame, include_header=False, strip_newline=False) ->
4651
return results
4752

4853

49-
def df_to_json(df: cudf.DataFrame, strip_newlines=False) -> typing.List[str]:
54+
def df_to_json(df: cudf.DataFrame, strip_newlines=False, include_index_col=True) -> typing.List[str]:
5055
"""
5156
Serializes a DataFrame into JSON and returns the serialized output seperated by lines.
5257
@@ -56,7 +61,10 @@ def df_to_json(df: cudf.DataFrame, strip_newlines=False) -> typing.List[str]:
5661
Input DataFrame to serialize.
5762
strip_newline : bool, optional
5863
Whether or not to strip the newline characters from each string, by default False.
59-
64+
include_index_col: bool, optional
65+
Write out the index as a column, by default True.
66+
Note: This value is currently being ignored due to a known issue in Pandas:
67+
https://github.com/pandas-dev/pandas/issues/37600
6068
Returns
6169
-------
6270
typing.List[str]
@@ -65,7 +73,7 @@ def df_to_json(df: cudf.DataFrame, strip_newlines=False) -> typing.List[str]:
6573
str_buf = StringIO()
6674

6775
# Convert to list of json string objects
68-
df.to_json(str_buf, orient="records", lines=True)
76+
df.to_json(str_buf, orient="records", lines=True, index=include_index_col)
6977

7078
# Start from beginning
7179
str_buf.seek(0)

morpheus/stages/general/buffer_stage.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
from morpheus.config import Config
2121
from morpheus.pipeline.single_port_stage import SinglePortStage
2222
from morpheus.pipeline.stream_pair import StreamPair
23-
from morpheus.utils.logging import deprecated_stage_warning
23+
from morpheus.utils.logger import deprecated_stage_warning
2424

2525
logger = logging.getLogger(__name__)
2626

0 commit comments

Comments
 (0)