Ejemplo: #Apache #Avro utilizando #Java

Apache Avro utilizando JavaApache Avro es un sistema de serialización de datos.

En esta ocasión he querido compartirle un breve ejercicio para ejemplificar el uso de Apache Avro utilizando Java.

Para comenzar, creamos un proyecto nuevo con ayuda de Gradle.

Si aún no han leído mi artículo #Gradle Integration for #Eclipse (4.4) es un buen momento para hacerlo.

Preparando el build.gradle

Una vez que contemos con nuestro proyecto nuevo basado en Gradle, hemos de configurar el build.gradle para incorporar las dependencias:

‘org.apache.avro:avro:1.7.7’
‘org.apache.avro:avro-mapred:1.7.7’
‘org.apache.hadoop:hadoop-core:1.1.0’

así como el plugin de avro.

Apache Avro utilizando Java

En la primera ejecución, se descargará el plugin

Apache Avro utilizando Java

Es importante colocar el bloque plugin al comienzo del archivo “build.gradle” ya que de lo contrario

Apache Avro utilizando Java

obtendríamos un error como este

Apache Avro utilizando Java

Creando el esquema

Creamos el archivo “user.avsc” y definimos el esquema

Apache Avro utilizando Java

Ejecutando el gradle

Apache Avro utilizando Java

Ya mencionamos en el artículo #Gradle Integration for #Eclipse (4.4)  que es posible que el plugin Gradle Integration for Eclipse (4.4) falle al intentar ejecutar

Run As > Gradle Build

Si ello ocurre, podemos desinstalar los plugins de Gradle que tengamos instalados en nuestro Eclipse IDE.

En nuestro caso:

  • Minimalist Gradle Editor
  • Gradle Integration for Eclipse (4.4)

Y proceder a instalar…

  • Gradle IDE Pack

Alternativamente, podemos abrir una línea de comando utilizando el Path Tools

Apache Avro utilizando Java

Si definimos en nuestro build.gradle las tareas por default

defaultTasks ‘clean’, ‘build’

Podremos ejecutar el comando

gradle

Apache Avro utilizando Java

En caso contrario, al intentar ejecutar el comando “gradle” se nos indicará que hemos de señalar de manera explícita la tarea que deseamos ejecutar.

Es decir, en lugar de solo escribir el comando

gradle

Habremos de escribir:

gradle build

Apache Avro utilizando Java

Si al ejecutar nuestro script “build.gradle” observamos la advertencia

warning: [options] bootstrap class path not set in conjunction with -source 1.7

podemos retirar la línea

sourceCompatibility = 1.7

de nuestro script “build.gradle”

FIX gradle.plugin.avro

El gradle.plugin.avro presenta un Bug que ocasiona que sean eliminadas las clases del directorio destino por lo cual no se debe colocar directamente en el src/main/java ya que ocasionaría la perdida de código no generado.

Para solventar esta situación se modificó la ejecución en el script del build.gradle

En lugar de:

task generateAvro(type: com.commercehub.gradle.plugin.avro.GenerateAvroJavaTask, dependsOn:'makePretty') {
 source("src/main/resources/avro")
 outputDir = file("dest/java")
 }

compileJava.source(generateAvro.outputs)

Se colocó:

task generateAvro(type: com.commercehub.gradle.plugin.avro.GenerateAvroJavaTask, dependsOn:'makePretty') {
 source("src/main/resources/avro")
 outputDir = file("dest/java")
 }

task fixPluginAvro (type: Delete, dependsOn:'copyTask') {
 delete 'dest'
}

task copyTask (type: Copy, dependsOn:'generateAvro') {
 from 'dest/java'
 into 'src/main/java'
}

task makePretty(type: Delete) {
 delete 'dest'
 delete 'src/main/java/avro'
}

//compileJava.source(generateAvro.outputs)

build.dependsOn fixPluginAvro

Fix clases generadas

Las clases generadas muestran un error de compilación

avro-10

El cual puede corregirse fácilmente con tan solo comentar el anotation

@Override

Probando Avro

Implementamos una clase para observar el comportamiento de Avro

Apache Avro utilizando Java

Código:

/**
 *
 */
 package avro.example;

 import java.io.File;
 import java.io.IOException;
 import java.util.ArrayList;
 import java.util.List;

 import org.apache.avro.file.DataFileReader;
 import org.apache.avro.file.DataFileWriter;
 import org.apache.avro.io.DatumReader;
 import org.apache.avro.io.DatumWriter;
 import org.apache.avro.specific.SpecificDatumReader;
 import org.apache.avro.specific.SpecificDatumWriter;

 import avro.gen.User;

/**
 * @author alexander
 *
 */
 public class Avro {

/**
 * Constructor
 */
 public Avro() {
 }

 /**
 * @param args
 */
 public static void main(String[] args) {
 Avro avro = new Avro();
 List listUsers = avro.generateUsers();
 avro.serializing(listUsers);
 avro.deserializing();
 }

/**
 * Create some Users and set their fields.
 * Avro objects can be created either by invoking a constructor directly or by using a builder.
 * Additionally, builders validate the data as it set, where as objects constructed directly will not cause an error until the object is serialized.
 * However, using constructors directly generally offers better performance, as builders create a copy of the datastructure before it is written.
 * @return List
 */
 private List generateUsers(){

 User user1 = new User();
 user1.setName("Alyssa");
 user1.setFavoriteNumber(256);
 // Leave favorite color null

 // Alternate constructor
 User user2 = new User("Ben", 7, "red");

 // Construct via builder
 User user3 = User.newBuilder()
 .setName("Charlie")
 .setFavoriteColor("blue")
 .setFavoriteNumber(null)
 .build();

 List listUsers = new ArrayList();
 listUsers.add(user1);
 listUsers.add(user2);
 listUsers.add(user3);

 return listUsers;
 }

 /**
 * Serialize our Users to disk.
 */
 private void serializing(List listUsers) {
 // We create a DatumWriter, which converts Java objects into an in-memory serialized format.
 // The SpecificDatumWriter class is used with generated classes and extracts the schema from the specified generated type.
 DatumWriter userDatumWriter = new SpecificDatumWriter(User.class);
 // We create a DataFileWriter, which writes the serialized records, as well as the schema, to the file specified in the dataFileWriter.create call.
 DataFileWriter dataFileWriter = new DataFileWriter(userDatumWriter);
 try {
 dataFileWriter.create(((User) listUsers.get(0)).getSchema(), new File("users.avro"));
 for (User user : listUsers) {
 // We write our users to the file via calls to the dataFileWriter.append method.
 dataFileWriter.append(user);
 }
 // When we are done writing, we close the data file.
 dataFileWriter.close();
 } catch (IOException e) {
 e.printStackTrace();
 }
 }

 /**
 * Deserialize Users from disk
 */
 private void deserializing() {
 // We create a SpecificDatumReader, analogous to the SpecificDatumWriter we used in serialization, which converts in-memory serialized items into instances of our generated class, in this case User.
 DatumReader userDatumReader = new SpecificDatumReader(User.class);
 // We pass the DatumReader and the previously created File to a DataFileReader, analogous to the DataFileWriter, which reads the data file on disk.
 try {
 DataFileReader dataFileReader = new DataFileReader(new File ("users.avro"), userDatumReader);
 User user = null;
 // Next we use the DataFileReader to iterate through the serialized Users and print the deserialized object to stdout.
 while (dataFileReader.hasNext()) {
 // Reuse user object by passing it to next(). This saves us from
 // allocating and garbage collecting many objects for files with
 // many items.
 user = dataFileReader.next(user);
 System.out.println(user);
 }
 } catch (IOException e) {
 e.printStackTrace();
 }
 }
 }

Resultados

El resultado de la serialización:

Apache Avro utilizando Java

El resultado de la deserialización:

Apache Avro utilizando Java

Mejorando el código

Definimos el archivo avro.properties

Apache Avro utilizando Java

Código:

#nombre_propiedad=valor_propiedad
 pathAvro=C:\\ws\\avro\\avro\\build\\avro
 nameFileAvro=users.avro

Implementamos la utilería UtilProperties

Apache Avro utilizando Java

Código:

/**
 *
 */
 package avro.example;
 
 import java.io.FileInputStream;
 import java.io.FileNotFoundException;
 import java.io.IOException;
 import java.util.Properties;

 /**
 * @author alexander
 *
 */
 public class UtilProperties {

 /**
 * Path del archivo properties
 */
 private static final String AVRO_PROPERTIES = "C:\\ws\\avro\\avro\\src\\main\\resources\\avro.properties";

 /**
 *
 */
 public UtilProperties() {
 }

 public static String getProperty(String nameProperty) {

 /**Creamos un Objeto de tipo Properties*/
 Properties propiedades = new Properties();

 /**Cargamos el archivo desde la ruta especificada*/
 try {
 propiedades.load(new FileInputStream(AVRO_PROPERTIES));
 } catch (FileNotFoundException e) {
 e.printStackTrace();
 } catch (IOException e) {
 e.printStackTrace();
 }

 /**Obtenemos los parametros definidos en el archivo*/
 String property = propiedades.getProperty(nameProperty);

 return property;

 }
 }

E incorporamos su uso en la clase Avro para hacer parametrizable su ejecución

/**
 *
 */
 package avro.example;

 import java.io.File;
 import java.io.IOException;
 import java.util.ArrayList;
 import java.util.List;
 
 import org.apache.avro.file.DataFileReader;
 import org.apache.avro.file.DataFileWriter;
 import org.apache.avro.io.DatumReader;
 import org.apache.avro.io.DatumWriter;
 import org.apache.avro.specific.SpecificDatumReader;
 import org.apache.avro.specific.SpecificDatumWriter;

 import avro.gen.User;

 /**
 * @author alexander
 *
 */
 public class Avro {

 private static final String NAME_FILE_AVRO = "nameFileAvro";
 private static final String PATH_AVRO = "pathAvro";

 /**
 * Constructor
 */
 public Avro() {
 }

 /**
 * @param args
 */
 public static void main(String[] args) {
 Avro avro = new Avro();
 List listUsers = avro.generateUsers();
 avro.serializing(listUsers);
 avro.deserializing();
 }

 /**
 * Create some Users and set their fields.
 * Avro objects can be created either by invoking a constructor directly or by using a builder.
 * Additionally, builders validate the data as it set, where as objects constructed directly will not cause an error until the object is serialized.
 * However, using constructors directly generally offers better performance, as builders create a copy of the datastructure before it is written.
 * @return List
 */
 private List generateUsers(){

 User user1 = new User();
 user1.setName("Alyssa");
 user1.setFavoriteNumber(256);
 // Leave favorite color null

 // Alternate constructor
 User user2 = new User("Ben", 7, "red");

 // Construct via builder
 User user3 = User.newBuilder()
 .setName("Charlie")
 .setFavoriteColor("blue")
 .setFavoriteNumber(null)
 .build();

 List listUsers = new ArrayList();
 listUsers.add(user1);
 listUsers.add(user2);
 listUsers.add(user3);

 return listUsers;
 }

 /**
 * Serialize our Users to disk.
 */
 private void serializing(List listUsers) {
 // We create a DatumWriter, which converts Java objects into an in-memory serialized format.
 // The SpecificDatumWriter class is used with generated classes and extracts the schema from the specified generated type.
 DatumWriter userDatumWriter = new SpecificDatumWriter(User.class);
 // We create a DataFileWriter, which writes the serialized records, as well as the schema, to the file specified in the dataFileWriter.create call.
 DataFileWriter dataFileWriter = new DataFileWriter(userDatumWriter);

 try {
 File file = createFile();
 dataFileWriter.create(((User) listUsers.get(0)).getSchema(), file);
 for (User user : listUsers) {
 // We write our users to the file via calls to the dataFileWriter.append method.
 dataFileWriter.append(user);
 }
 // When we are done writing, we close the data file.
 dataFileWriter.close();
 } catch (IOException e) {
 e.printStackTrace();
 } 
 }

 /**
 * Crea el File
 * @return File
 */
 private File createFile() {

 File folder = new File(UtilProperties.getProperty(PATH_AVRO));
 if (!folder.exists()) {
 folder.mkdirs(); // esto crea la carpeta java, independientemente que exista el path completo, si no existe crea toda la ruta necesaria
 }
 String absolutePathFile = folder.getPath().concat("\\").concat(UtilProperties.getProperty(NAME_FILE_AVRO));
 System.out.println(absolutePathFile);

 File file = new File(absolutePathFile);

 return file;
 }

 /**
 * Deserialize Users from disk
 */
 private void deserializing() {
 // We create a SpecificDatumReader, analogous to the SpecificDatumWriter we used in serialization, which converts in-memory serialized items into instances of our generated class, in this case User.
 DatumReader userDatumReader = new SpecificDatumReader(User.class);
 // We pass the DatumReader and the previously created File to a DataFileReader, analogous to the DataFileWriter, which reads the data file on disk.
 try {
 File file = createFile();
 @SuppressWarnings("resource")
 DataFileReader dataFileReader = new DataFileReader(file, userDatumReader);
 User user = null;
 // Next we use the DataFileReader to iterate through the serialized Users and print the deserialized object to stdout.
 while (dataFileReader.hasNext()) {
 // Reuse user object by passing it to next(). This saves us from
 // allocating and garbage collecting many objects for files with
 // many items.
 user = dataFileReader.next(user);
 System.out.println(user);
 }
 // DataFileReader no cuenta con un metodo close()
 dataFileReader=null;
 } catch (IOException e) {
 e.printStackTrace();
 }
 }
 }

Primeras mediciones

Engordamos la lista de User a 3 millones con un búcle para generar usuarios

Código:

/**
 *
 */
 package avro.example;
 
 import java.io.File;
 import java.io.IOException;
 import java.util.ArrayList;
 import java.util.List;

 import org.apache.avro.file.DataFileReader;
 import org.apache.avro.file.DataFileWriter;
 import org.apache.avro.io.DatumReader;
 import org.apache.avro.io.DatumWriter;
 import org.apache.avro.specific.SpecificDatumReader;
 import org.apache.avro.specific.SpecificDatumWriter;

 import avro.gen.User;

 /**
 * @author alexander
 *
 */
 public class Avro {

 private static final String NAME_FILE_AVRO = "nameFileAvro";
 private static final String PATH_AVRO = "pathAvro";

 /**
 *
 */
 public Avro() {
 } 

 /**
 * @param args
 */
 public static void main(String[] args) {
 Avro avro = new Avro();
 List listUsers = avro.generateUsers();

 //Engordamos la lista a 3 millones de registros
 int i = 1000000;
 do {
 listUsers.addAll(avro.generateUsers());
 --i;
 } while (i>0);

 avro.serializing(listUsers);
 avro.deserializing();
 }

 /**
 * Create some Users and set their fields.
 * Avro objects can be created either by invoking a constructor directly or by using a builder.
 * Additionally, builders validate the data as it set, where as objects constructed directly will not cause an error until the object is serialized.
 * However, using constructors directly generally offers better performance, as builders create a copy of the datastructure before it is written.
 * @return List
 */
 private List generateUsers(){

 User user1 = new User();
 user1.setName("Alyssa");
 user1.setFavoriteNumber(256);
 // Leave favorite color null

 // Alternate constructor
 User user2 = new User("Ben", 7, "red");

 // Construct via builder
 User user3 = User.newBuilder()
 .setName("Charlie")
 .setFavoriteColor("blue")
 .setFavoriteNumber(null)
 .build();

 List listUsers = new ArrayList();
 listUsers.add(user1);
 listUsers.add(user2);
 listUsers.add(user3); 

 return listUsers;
 } 

 /**
 * Serialize our Users to disk.
 */
 private void serializing(List listUsers) {
 long tiempoInicio = System.currentTimeMillis();
 // We create a DatumWriter, which converts Java objects into an in-memory serialized format.
 // The SpecificDatumWriter class is used with generated classes and extracts the schema from the specified generated type.
 DatumWriter userDatumWriter = new SpecificDatumWriter(User.class);
 // We create a DataFileWriter, which writes the serialized records, as well as the schema, to the file specified in the dataFileWriter.create call.
 DataFileWriter dataFileWriter = new DataFileWriter(userDatumWriter); 

 try {
 File file = createFile();
 dataFileWriter.create(((User) listUsers.get(0)).getSchema(), file);
 for (User user : listUsers) {
 // We write our users to the file via calls to the dataFileWriter.append method.
 dataFileWriter.append(user);
 }
 // When we are done writing, we close the data file.
 dataFileWriter.close();
 } catch (IOException e) {
 e.printStackTrace();
 }
 terminaProceso("serializing", tiempoInicio);
 } 

 /**
 * Crea el File
 * @return File
 */
 private File createFile() {
 File folder = new File(UtilProperties.getProperty(PATH_AVRO));
 if (!folder.exists()) {
 folder.mkdirs(); // esto crea la carpeta java, independientemente que exista el path completo, si no existe crea toda la ruta necesaria
 }
 String absolutePathFile = folder.getPath().concat("\\").concat(UtilProperties.getProperty(NAME_FILE_AVRO));
 System.out.println(absolutePathFile);

 File file = new File(absolutePathFile);
 return file;
 }

 /**
 * Deserialize Users from disk
 */
 private void deserializing() {
 long tiempoInicio = System.currentTimeMillis();
 // We create a SpecificDatumReader, analogous to the SpecificDatumWriter we used in serialization, which converts in-memory serialized items into instances of our generated class, in this case User.
 DatumReader userDatumReader = new SpecificDatumReader(User.class);
 // We pass the DatumReader and the previously created File to a DataFileReader, analogous to the DataFileWriter, which reads the data file on disk.
 try {
 File file = createFile();
 @SuppressWarnings("resource")
 DataFileReader dataFileReader = new DataFileReader(file, userDatumReader);
 User user = null;
 // Next we use the DataFileReader to iterate through the serialized Users and print the deserialized object to stdout.
 while (dataFileReader.hasNext()) {
 // Reuse user object by passing it to next(). This saves us from
 // allocating and garbage collecting many objects for files with
 // many items.
 user = dataFileReader.next(user);
 System.out.println(user);
 }
 // DataFileReader no cuenta con un metodo close()
 dataFileReader=null;
 } catch (IOException e) {
 e.printStackTrace();
 }
 terminaProceso("deserializing", tiempoInicio);
 } 

 private void terminaProceso(String nameProcess, long tiempoInicio) {
 long totalTiempo = System.currentTimeMillis() - tiempoInicio;
 System.out.println("El tiempo del proceso " + nameProcess + " es :" + totalTiempo + " miliseg");
 }
 }

Al ejecutar

El archivo generado alcanza los 37,012,502 bytes

O lo que es igual: 35.3 MB

Apache Avro utilizando Java

Y el tiempo de deserialización: 191,417 miliseg.

Apache Avro utilizando Java

O lo que es igual: 3.19 minutos

Apache Avro utilizando Java

Para finalizar, les dejo para que puedan clonar el proyecto desde GitHub:

Referencias:

Anuncios

Responder

Por favor, inicia sesión con uno de estos métodos para publicar tu comentario:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s