Issues using ETL to import CSV file with edges


#1

Hi,

I’ve been struggling the last few days trying to import my data into OrientDB (3.0.12 community edition). I have two files, one containing the data for the vertices [Id; label; description] (which imports fine), and another one with the edges [Source; Target; Id]. I have validated that Gephi can import these files just fine and that the connections are correct.

I’ve been trying to write a JSON file that imports the edges.csv and creates the edges in the database. The current file is as follows:

{

“config”: {
“log”: “debug”
},
“source”: { “file”: { “path”: “/OrientDB/csv/GRC-edges.csv” } },
“extractor”: {
“csv”:
{
“columns”: [
“Source:integer”,
“Target:integer”,
“Id:integer”,
],
“separator”: “;”
}
},
“transformers”: [
{“merge”: {
“joinFieldName”: “Source”,
“lookup”: “Text.Id”
}
},
{ “vertex” : {
“class” : “Text”,
“skipDuplicates” : true
}
},
{ “edge” :
{
“class”: “Link”,
“joinFieldName”: “Target”,
“lookup”: “Text.Id”,
“direction”: “out”,
“unresolvedLinkAction”: “SKIP”
}
},
{
“field”: {
“fieldName”: “@class”,
“value”: “Link”
}
}
],
“loader”: {
“orientdb”: {
“dbURL”: “plocal:…/databases/GRC”,
“dbType”: “graph”,
“classes”: [
{“name”: “Text”, “extends”: “V”},
{“name”: “Link”, “extends”: “E”}
],
“indexes”: [
]
}
}
}

When executing this command I get:
~/Databases/OrientDB/orientdb-3.0.12/bin $ ./oetl.sh …/…/csv/importGRC-edgesExt.json

OrientDB etl v.3.0.12 - Veloce (build b89d566f96b4632bc56422887d7f249e5c2ce23f, branch 3.0.x) https://www.orientdb.com

Warning: Nashorn engine is planned to be removed from a future JDK release

2018-12-28 00:40:47:852 INFO Default limit of open files (512) will be used. [ONative]

2018-12-28 00:40:47:928 INFO [csv] column types: {Source=INTEGER, Target=INTEGER, Id=INTEGER} [OETLCSVExtractor]

2018-12-28 00:40:48:045 INFO 17179869184 B/16384 MB/16 GB of physical memory were detected on machine [ONative]

2018-12-28 00:40:48:045 INFO Detected memory limit for current process is 17179869184 B/16384 MB/16 GB [ONative]

2018-12-28 00:40:48:047 INFO JVM can use maximum 2048MB of heap memory [OMemoryAndLocalPaginatedEnginesInitializer]

2018-12-28 00:40:48:048 INFO Because OrientDB is running outside a container 12% of memory will be left unallocated according to the setting ‘memory.leftToOS’ not taking into account heap memory [OMemoryAndLocalPaginatedEnginesInitializer]

2018-12-28 00:40:48:051 INFO OrientDB auto-config DISKCACHE=12.369MB (heap=2.048MB os=16.384MB) [orientechnologies]

2018-12-28 00:40:48:052 INFO System is started under an effective user : orient [OEngineLocalPaginated]

2018-12-28 00:40:48:153 INFO WAL maximum segment size is set to 7.487 MB [OrientDBEmbedded]

2018-12-28 00:40:48:184 INFO BEGIN ETL PROCESSOR [OETLProcessor]

2018-12-28 00:40:48:185 INFO [file] Reading from file /Users/lodewijk/Databases/OrientDB/csv/GRC-edges.csv with encoding UTF-8 [OETLFileSource]

2018-12-28 00:40:48:185 INFO Started execution with 1 worker threads [OETLProcessor]

2018-12-28 00:40:48:294 INFO Page size for WAL located in /OrientDB/orientdb-3.0.12/bin/…/databases/GRC is set to 4096 bytes. [OCASDiskWriteAheadLog]

2018-12-28 00:40:48:504 INFO Storage ‘plocal:/OrientDB/orientdb-3.0.12/bin/…/databases//GRC’ is opened under OrientDB distribution : 3.0.12 - Veloce (build b89d566f96b4632bc56422887d7f249e5c2ce23f, branch 3.0.x) [OLocalPaginatedStorage]

2018-12-28 00:40:49:189 INFO + extracted 1 rows (0 rows/sec) - 1 rows -> loaded 0 vertices (0 vertices/sec) Total time: 1005ms [0 warnings, 0 errors] [OETLProcessor]

2018-12-28 00:40:49:195 SEVER {db=GRC} Error in Pipeline execution: java.lang.ClassCastException: class com.orientechnologies.orient.core.id.ORecordId cannot be cast to class com.orientechnologies.orient.core.record.impl.ODocument (com.orientechnologies.orient.core.id.ORecordId and com.orientechnologies.orient.core.record.impl.ODocument are in unnamed module of loader ‘app’) [OETLPipeline]

2018-12-28 00:40:49:199 SEVER ETL process has problem: [OETLProcessor]

2018-12-28 00:40:49:200 INFO END ETL PROCESSOR [OETLProcessor]

2018-12-28 00:40:49:201 INFO + extracted 159 rows (14.363 rows/sec) - 159 rows -> loaded 0 vertices (0 vertices/sec) Total time: 1016ms [0 warnings, 1 errors] [OETLProcessor]^C

2018-12-28 00:43:51:659 WARNI Received signal: SIGINT [OSignalHandler]

2018-12-28 00:43:51:662 INFO Orient Engine is shutting down… [Orient]

2018-12-28 00:43:51:663 INFO - shutdown storage: GRC… [OrientDBEmbedded]

2018-12-28 00:43:51:801 INFO Clearing byte buffer pool [Orient]

2018-12-28 00:43:51:801 INFO OrientDB Engine shutdown complete [Orient]

I’ve been parsing all the documentation and just can’t find what I am doing wrong. Can anyone help me find my error?


#2

Hi,

could you post the content of the csv file?

Thanks
Regards,
Michela


#3

Thanks for helping. In the meantime I have simplified my CSV files just to get a handle on how to import Vertices and Edges from two different files. This is proving to be difficult as well.

The simplified Vertex CSV and the simplified Edge CSV are below. Also added are the json files I have created to import these files.

When I import the Vertices with the JSON file, everything seems in order - all vertices are loaded into the database correctly (some have text with carriage returns in them).

When I import the Edges with the JSON file, I get:

2018-12-28 17:54:28:759 WARNI {db=GRC} WARNING: index Text.Id not found. Lookups could be really slow [OETLMergeTransformer]
2018-12-28 17:54:28:783 WARNI {db=GRC} WARNING: index Text.Id not found. Lookups could be really slow [OETLEdgeTransformer]
2018-12-28 17:54:28:786 WARNI {db=GRC} Transformer [edge] returned null, skip rest of pipeline execution [OETLPipeline]
2018-12-28 17:54:28:787 WARNI {db=GRC} Transformer [field] returned null, skip rest of pipeline execution [OETLPipeline]
2018-12-28 17:54:28:790 WARNI {db=GRC} Transformer [edge] returned null, skip rest of pipeline execution [OETLPipeline]
2018-12-28 17:54:28:790 WARNI {db=GRC} Transformer [field] returned null, skip rest of pipeline execution [OETLPipeline]

2018-12-28 17:54:30:636 INFO END ETL PROCESSOR [OETLProcessor]
2018-12-28 17:54:30:636 INFO + extracted 1.334 rows (350 rows/sec) - 1.334 rows -> loaded 0 vertices (0 vertices/sec) Total time: 2637ms [0 warnings, 0 errors] [OETLProcessor]

Nothing is imported - but at least I did not get a conversion error like the previous…

Why is it so hard to find a good example of importing vertices and edges from different files?

Your help is much appreciated!

Files

nodes.CSV

Id;ShortLabel;Reference;Type
1;AcmeCorp External Suppli;AcmeCorp;Title
2;Managing obsolescence ;AcmeCorp;Title
3;Incident Handling ;AcmeCorp;Title
4;Problem Management ;AcmeCorp;Title
5;Configuration Management;AcmeCorp;Title
6;Service Continuity ;AcmeCorp;Title
7;Backup arrangements for ;AcmeCorp;Title
8;Performance and capacity;AcmeCorp;Title
9;Change Management ;AcmeCorp;Title
10;Technology Risk ;AcmeCorp;Title
11;Ensuring ongoing support;AcmeCorp;Title
12;Recording, classifying a;AcmeCorp;Title
13;Identifying, assessing/a;AcmeCorp;Title
14;Isolating the Production;AcmeCorp;Title
15;Exercising source code m;AcmeCorp;Title
16;Providing and validating;AcmeCorp;Title
17;Managing the data centre;AcmeCorp;Title
18;Operating appropriate an;AcmeCorp;Title
19;Ensuring safe, secure an;AcmeCorp;Title
20;Remaining aligned to Acm;AcmeCorp;Title
21;Enforcing rigorous chang;AcmeCorp;Title
22;Maintaining complete and;AcmeCorp;Title
23;The supplier must prompt;AcmeCorp;Description
24;The supplier must operat;AcmeCorp;Description
25;The supplier must operat;AcmeCorp;Description
26;The supplier must mainta;AcmeCorp;Description
27;The supplier must ensure;AcmeCorp;Description
28;The supplier must ensure;AcmeCorp;Description
29;The supplier must unders;AcmeCorp;Description
30;The supplier must ensure;AcmeCorp;Description
31;The supplier must ensure;AcmeCorp;Description
32;The supplier must ensure;AcmeCorp;Description
33;The supplier must define;AcmeCorp;Description
34;The supplier must ensure;AcmeCorp;Description
35;No change without approp;AcmeCorp;Description
36;Segregation of duties be;AcmeCorp;Description
37;Changes planned and mana;AcmeCorp;Description
38;Changes take adequate ac;AcmeCorp;Description
39;All changes to be fully ;AcmeCorp;Description
40;Rights to modify the Pro;AcmeCorp;Description
41;Changes undergo technica;AcmeCorp;Description
42;Testing is performed in ;AcmeCorp;Description
43;Change is accompanied by;AcmeCorp;Description
44;Inadequate records and/o;AcmeCorp;Why
45;Technology incidents not;AcmeCorp;Why
46;Where underlying problem;AcmeCorp;Why
47;Inappropriate register e;AcmeCorp;Why
48;Absence or Inadequate se;AcmeCorp;Why
49;Absence or Poorly contro;AcmeCorp;Why
50;Inadequate definition an;AcmeCorp;Why
51;Inadequate measures to m;AcmeCorp;Why
52;Cyber Security;AcmeCorp;Title
53;Cyber Security Governanc;AcmeCorp;Title
54;The Supplier must have c;AcmeCorp;Description
55;If this control is not i;AcmeCorp;Why
56;Cyber Security Policy an;AcmeCorp;Title
57;The Supplier must mainta;AcmeCorp;Description
58;Documented policies and ;AcmeCorp;Why
59;Asset Management;AcmeCorp;Title
60;The Supplier must mainta;AcmeCorp;Description
61;If this control is not i;AcmeCorp;Why
62;Vulnerability Management;AcmeCorp;Title
63;The Supplier must operat;AcmeCorp;Description
64;The Supplier must establ;AcmeCorp;Description
65;Where the Supplier devel;AcmeCorp;Description
66;The Supplier must ensure;AcmeCorp;Description
67;All security issues and ;AcmeCorp;Description
68;IT Security patches and ;AcmeCorp;Description
69;If this control is not i;AcmeCorp;Why
70;Network Security;AcmeCorp;Title
71;The Supplier must ensure;AcmeCorp;Description
72;The following protection;AcmeCorp;Description
73;Capabilities must be def;AcmeCorp;Description
74;Network sensors must be ;AcmeCorp;Description
75;If this control is not i;AcmeCorp;Why
76;Cyber Security Awareness;AcmeCorp;Title
77;Enhanced cyber security ;AcmeCorp;Description
78;The Supplier must provid;AcmeCorp;Description
79;Training must address st;AcmeCorp;Description
80;If this control is not i;AcmeCorp;Why
81;Security Protection Tech;AcmeCorp;Title
82;Appropriate technologies;AcmeCorp;Description
83;Host systems and network;AcmeCorp;Description
84;If this control is not i;AcmeCorp;Why
85;Endpoint Security;AcmeCorp;Title
86;The Supplier must ensure;AcmeCorp;Description
87;This includes, but is no;AcmeCorp;Description
88;If this control is not i;AcmeCorp;Why
89;Denial of Service Detect;AcmeCorp;Title
90;The Supplier must implem;AcmeCorp;Description
91;The Supplier must ensure;AcmeCorp;Description
92;If this control is not i;AcmeCorp;Why
93;Monitoring/Logging;AcmeCorp;Title
94;The Supplier must ensure;AcmeCorp;Description
95;The Supplier must collec;AcmeCorp;Description
96;All key systems, includi;AcmeCorp;Description
97;Logs must be centralized;AcmeCorp;Description
98;The key events logged mu;AcmeCorp;Description
99;If this control is not i;AcmeCorp;Why

nodes.json

{
“config”: {
“log”: “debug”
},
“source”: { “file”: { “path”: “…/…/csv/nodes.csv” } },
“extractor”:
{“csv”:
{
“columns”: [“Id:integer”,“ShortLabel:string”,“Reference:string”,“Type:string”],
“separator”: “;”
}
},
“transformers”: [
{ “vertex”: { “class”: “Text” } }
],
“loader”: {
“orientdb”: {
“dbURL”: “plocal:…/databases/GRC”,
“dbType”: “graph”,
“classes”: [
{“name”: “Text”, “extends”: “V”}
], “indexes”: [
{“class”:“Text”, “fields”:[“Id:integer”], “type”:“UNIQUE” }
]
}
}
}

edges.csv:

Source;Target;Id
2;11;10000
4;13;10001
3;12;10002
5;22;10003
5;14;10004
5;15;10005
6;16;10006
6;17;10007
7;18;10008
7;19;10009
8;20;10010
9;21;10011
10;2;10012
10;3;10013
10;4;10014
10;5;10015
10;6;10016
10;7;10017
10;8;10018
10;9;10019
11;23;10020
12;24;10021
13;25;10022
22;26;10023
14;27;10024
15;28;10025
16;29;10026
17;30;10027
18;31;10028
19;32;10029
20;33;10030
34;35;10031
34;36;10032
34;37;10033
34;38;10034
34;39;10035
34;40;10036
34;41;10037
34;42;10038
34;43;10039
21;34;10040
1;10;10041
44;23;10042
45;24;10043
46;25;10044
47;26;10045
47;27;10046
47;28;10047
48;29;10048
48;30;10049
49;31;10050
49;32;10051
50;33;10052
51;34;10053
1;52;10054
52;53;10055
53;54;10056
55;54;10057
52;56;10058
56;57;10059
58;57;10060
52;59;10061
59;60;10062
61;60;10063
52;62;10064
62;63;10065
62;64;10066
62;65;10067
62;66;10068
62;67;10069
62;68;10070
69;63;10071
69;64;10072
69;65;10073
69;66;10074
69;67;10075
69;68;10076
52;70;10077
70;71;10078
70;72;10079
70;73;10080
70;74;10081
75;71;10082
75;72;10083
75;73;10084
75;74;10085
52;76;10086
76;77;10087
76;78;10088
76;79;10089
80;77;10090
80;78;10091
80;79;10092
52;81;10093
81;82;10094
81;83;10095
84;82;10096
84;83;10097
52;85;10098
85;86;10099
85;87;10100
88;86;10101
88;87;10102
52;89;10103
89;90;10104
89;91;10105
92;90;10106
92;91;10107
52;93;10108
93;94;10109
93;95;10110
93;96;10111
93;97;10112
93;98;10113
99;94;10114
99;95;10115
99;96;10116
99;97;10117
99;98;10118

edges.json

{
“config”: {
“log”: “debug”
},
“source”: {
“file”: {
“path”: “…/…/csv/edges.csv”
}
},
“extractor”: {
“csv”: {
“columns”: [
“Source:integer”,
“Target:integer”,
“Id:integer”
],
“separator”: “;”
}
},
“transformers”: [
{
“merge”: {
“joinFieldName”: “Source”,
“lookup”: “Text.Id”
}
},
{
“vertex”: {
“class”: “Text”,
“skipDuplicates”: true
}
},
{
“edge”: {
“class”: “Link”,
“joinFieldName”: “Target”,
“lookup”: “Text.Id”,
“direction”: “out”,
“unresolvedLinkAction”: “SKIP”
}
},
{
“field”: {
“fieldName”: “@class”,
“value”: “Link”
}
}
],
“loader”: {
“orientdb”: {
“dbURL”: “plocal:…/databases/GRC”,
“dbType”: “graph”,
“classes”: [
{
“name”: “Text”,
“extends”: “V”
},
{
“name”: “Link”,
“extends”: “E”
}
],
“indexes”: []
}
}
}


#4

Working further on the issue, I have changed my approach and rewritten the edge.json to include commands. Now I can get the Edges created correctly, with the right labels, but the ETL importer consistently spits out the following error:

2018-12-30 10:15:01:579 SEVER {db=GRC} input type not supported:: class java.util.ArrayList [OETLOrientDBLoader]
2018-12-30 10:15:01:592 SEVER {db=GRC} input type not supported:: class java.util.ArrayList [OETLOrientDBLoader]
2018-12-30 10:15:01:601 SEVER {db=GRC} input type not supported:: class java.util.ArrayList [OETLOrientDBLoader]

My edge.json file is as follows:

{
“config”: {
“log”: “debug”
},
“source”: { “file”: { “path”: “…/…/csv/GRC-edges.csv” } },
“extractor”: {
“csv”:
{
“columns”: [
“Source:integer”,
“Target:integer”,
“Id:integer”
],
“separator”: “;”
}
},
“transformers”: [
{
“command”: {
“command”: "
CREATE EDGE Link FROM
(select from Text where Id = {input.Source}) TO (select from Text where Id = {input.Target})
SET
Id = {input.Id}, Source = {input.Source},
Target = ${input.Target}"
}
},
{
“field”: {
“fieldName”: “@class”,
“value”: “Link”
}
}
],
“loader”: {
“orientdb”: {
“dbURL”: “plocal:…/databases/GRC”,
“dbType”: “graph”,
“classes”: [
{“name”: “Link”, “extends”: “E”}
], “indexes”: [
{“class”:“Link”, “fields”:[“Id:integer”], “type”:“NOTUNIQUE” }
]
}
}
}

Any hints on why I get the Arraylist error during import and how to fix it?