ETL Process has problem: [OETLProcessor]


#1

Hello,
I’m trying to import some .csv files with up to 200.000 lines. (It is the “the Simpsons by the data” dataset from kaggle https://www.kaggle.com/wcukierski/the-simpsons-by-the-data) but each time I try to start the oetl process it gets me this message:

-> sudo /opt/orientdb/bin/oetl.sh /home/petey/Dokumente/characters.json

OrientDB etl v.3.0.8 - Veloce (build ace6fe7e50f069968d2ec61678848606703db1c0, branch 3.0.x) https://www.orientdb.com

2018-12-08 16:47:34:253 INFO  Detected limit of amount of simultaneously open files is 1048576,  limit of open files for disk cache will be set to 523776 [ONative]
2018-12-08 16:47:34:352 INFO  8239050752 B/7857 MB/7 GB of physical memory were detected on machine [ONative]
2018-12-08 16:47:34:353 INFO  Soft memory limit for this process is set to -1 B/-1 MB/-1 GB [ONative]
2018-12-08 16:47:34:354 INFO  Hard memory limit for this process is set to -1 B/-1 MB/-1 GB [ONative]
2018-12-08 16:47:34:355 INFO  Path to 'memory' cgroup is '/' [ONative]
2018-12-08 16:47:34:358 INFO  Mounting path for memory cgroup controller is '/sys/fs/cgroup/memory' [ONative]
2018-12-08 16:47:34:359 INFO  cgroup soft memory limit is 9223372036854771712 B/8796093022207 MB/8589934591 GB [ONative]
2018-12-08 16:47:34:359 INFO  cgroup hard memory limit is 9223372036854771712 B/8796093022207 MB/8589934591 GB [ONative]
2018-12-08 16:47:34:360 INFO  Detected memory limit for current process is 8239050752 B/7857 MB/7 GB [ONative]
2018-12-08 16:47:34:362 INFO  JVM can use maximum 1963MB of heap memory [OMemoryAndLocalPaginatedEnginesInitializer]
2018-12-08 16:47:34:362 INFO  Because OrientDB is running outside a container 12% of memory will be left unallocated according to the setting 'memory.leftToOS' not taking into account heap memory [OMemoryAndLocalPaginatedEnginesInitializer]
2018-12-08 16:47:34:365 INFO  OrientDB auto-config DISKCACHE=4.951MB (heap=1.963MB os=7.857MB) [orientechnologies]
2018-12-08 16:47:34:370 INFO  System is started under an effective user : `root` [OEngineLocalPaginated]
2018-12-08 16:47:34:749 INFO  Storage 'plocal:/opt/orientdb/databases//Simpsons_by_the_data' is opened under OrientDB distribution : 3.0.8 - Veloce (build ace6fe7e50f069968d2ec61678848606703db1c0, branch 3.0.x) [OLocalPaginatedStorage]
2018-12-08 16:47:35:416 INFO  Storage 'plocal:/opt/orientdb/databases//Simpsons_by_the_data' is created under OrientDB distribution : 3.0.8 - Veloce (build ace6fe7e50f069968d2ec61678848606703db1c0, branch 3.0.x) [OLocalPaginatedStorage]
2018-12-08 16:47:37:672 INFO  BEGIN ETL PROCESSOR [OETLProcessor]
2018-12-08 16:47:37:674 INFO  [file] Reading from file /home/petey/Dokumente/the-simpsons-by-the-data/simpsons_characters.csv with encoding UTF-8 [OETLFileSource]
2018-12-08 16:47:37:674 INFO  Started execution with 1 worker threads [OETLProcessor]
2018-12-08 16:47:38:674 INFO  + extracted 0 rows (0 rows/sec) - 0 rows -> loaded 0 vertices (0 vertices/sec) Total time: 1000ms [0 warnings, 0 errors] [OETLProcessor]
2018-12-08 16:47:39:673 INFO  + extracted 509 rows (509 rows/sec) - 509 rows -> loaded 115 vertices (115 vertices/sec) Total time: 2s [0 warnings, 0 errors] [OETLProcessor]
**2018-12-08 16:47:40:127 SEVER ETL process has problem:  [OETLProcessor]**
2018-12-08 16:47:40:128 INFO  END ETL PROCESSOR [OETLProcessor]
2018-12-08 16:47:40:130 INFO  + extracted 916 rows (892 rows/sec) - 916 rows -> loaded 916 vertices (1.756 vertices/sec) Total time: 2456ms [0 warnings, 0 errors] [OETLProcessor]

It always stops after 916 records out of more than 6000.

My etl config file looks like that:

{
  "source": { "file": { "path": "/home/petey/Dokumente/the-simpsons-by-the-data/simpsons_characters.csv" } },
  "extractor": { "csv": {} },
  "transformers": [
    { "vertex": { "class": "Characters" } }
  ],
  "loader": {
    "orientdb": {
       "dbURL": "plocal:/opt/orientdb/databases/Simpsons_by_the_data",
       "wal":false,
       "batchCommit": 500,
       "dbAutoCreate": true,
"dbAutoDropIfExists":true,
"dbAutoCreateProperties":true,
       "txUseLog": true,
       "dbType": "graph",
       "classes": [
         {"name": "Characters", "extends": "V"},
         {"name": "Locations", "extends": "V"},
      	 {"name": "Episodes", "extends": "V"},
      	 {"name": "Script_lines", "extends": "V"},
         {"name": "from", "extends": "E"},
  	 {"name": "where", "extends": "E"},
	 {"name": "when", "extends": "E"}
       ], "indexes": [
         {"class":"Script_lines", "fields":["id:integer"], "type":"UNIQUE" }
       ]
    }
  }
}

What am I doing wrong?

Thanks!


#2

Think I got it. The Problem was the csv file was not valid. There were stray quotes around. I was able to load the episodes, characters and locations.csv but the Process still gets grumpy about the biggest one with the script lines.

I deleted all unnecessary lines and just kept the normalized text and numbers. But it still wont load. Can’t check if it is valid. It is too big for csvlint. It ist just about 125.000 lines

    OrientDB etl v.3.0.8 - Veloce (build ace6fe7e50f069968d2ec61678848606703db1c0, branch 3.0.x) https://www.orientdb.com
2018-12-11 19:53:52:015 INFO  Detected limit of amount of simultaneously open files is 1048576,  limit of open files for disk cache will be set to 523776 [ONative]
2018-12-11 19:53:52:123 INFO  8239050752 B/7857 MB/7 GB of physical memory were detected on machine [ONative]
2018-12-11 19:53:52:125 INFO  Soft memory limit for this process is set to -1 B/-1 MB/-1 GB [ONative]
2018-12-11 19:53:52:125 INFO  Hard memory limit for this process is set to -1 B/-1 MB/-1 GB [ONative]
2018-12-11 19:53:52:126 INFO  Path to 'memory' cgroup is '/' [ONative]
2018-12-11 19:53:52:131 INFO  Mounting path for memory cgroup controller is '/sys/fs/cgroup/memory' [ONative]
2018-12-11 19:53:52:132 INFO  cgroup soft memory limit is 9223372036854771712 B/8796093022207 MB/8589934591 GB [ONative]
2018-12-11 19:53:52:132 INFO  cgroup hard memory limit is 9223372036854771712 B/8796093022207 MB/8589934591 GB [ONative]
2018-12-11 19:53:52:133 INFO  Detected memory limit for current process is 8239050752 B/7857 MB/7 GB [ONative]
2018-12-11 19:53:52:135 INFO  JVM can use maximum 1963MB of heap memory [OMemoryAndLocalPaginatedEnginesInitializer]
2018-12-11 19:53:52:136 INFO  Because OrientDB is running outside a container 12% of memory will be left unallocated according to the setting 'memory.leftToOS' not taking into account heap memory [OMemoryAndLocalPaginatedEnginesInitializer]
2018-12-11 19:53:52:139 INFO  OrientDB auto-config DISKCACHE=4.951MB (heap=1.963MB os=7.857MB) [orientechnologies]
2018-12-11 19:53:52:144 INFO  System is started under an effective user : `root` [OEngineLocalPaginated]
2018-12-11 19:53:52:193 INFO  BEGIN ETL PROCESSOR [OETLProcessor]
2018-12-11 19:53:52:195 INFO  [file] Reading from file /home/petey/Dokumente/the-simpsons-by-the-data/simpsons_script_lines.csv with encoding UTF-8 [OETLFileSource]
2018-12-11 19:53:52:196 INFO  Started execution with 1 worker threads [OETLProcessor]
2018-12-11 19:53:52:201 SEVER ETL process has problem:  [OETLProcessor]
2018-12-11 19:53:52:203 INFO  END ETL PROCESSOR [OETLProcessor]
2018-12-11 19:53:52:204 INFO  + extracted 0 rows (0 rows/sec) - 0 rows -> loaded 0 vertices (0 vertices/sec) Total time: 9ms [0 warnings, 0 errors] [OETLProcessor]

Json file:

{
  "source": { "file": { "path": "/home/petey/Dokumente/the-simpsons-by-the-data/simpsons_script_lines.csv" } },
  "extractor": { "csv": {} },
  "transformers": [
    { "vertex": { "class": "Script_lines" } },
    { "edge": { "class": "from",
                "joinFieldName": "character_id",
                "lookup": "Characters.id",
                "direction": "out"
            }
        },
    { "edge": { "class": "where",
                "joinFieldName": "location_id",
                "lookup": "Locations.id",
                "direction": "out"
            }
        },
    { "edge": { "class": "when",
                "joinFieldName": "episode_id",
                "lookup": "Episodes.id",
                "direction": "out"
            }
        }
  ],
  "loader": {
    "orientdb": {
       "dbURL": "plocal:/opt/orientdb/databases/Simpsons_by_the_data",
       "wal":false,
       "batchCommit": 500,
       "dbAutoCreate": true,
       "dbAutoCreateProperties":true,
       "txUseLog": true,
       "dbType": "graph",
       "classes": [
         {"name": "Characters", "extends": "V"},
         {"name": "Locations", "extends": "V"},
      	 {"name": "Episodes", "extends": "V"},
      	 {"name": "Script_lines", "extends": "V"},
         {"name": "from", "extends": "E"},
  	 {"name": "where", "extends": "E"},
	 {"name": "when", "extends": "E"}
       ], "indexes": [
         {"class":"Script_lines", "fields":["id:integer"], "type":"UNIQUE" }
       ]
    }
  }
}