Need: We are looking to form a team of two skilled developers that are willing to work on a big data project for the three different tasks that we will explain in a bit.
Objective: The objective is to scale the platform for future needs as well as address limitations of retrieving data by building a mini search engine with Elasticsearch.
Minimizing the cost is important factor which motivated the use of open-source tools such as: Apache NiFi
Apache Spark
Ellasandra (Cassandra + Elastic search)
User stories:
User story 1: As a user, I want to be able to import a heavy csv file to Cassandra.
User story 2: As a user, I want to be able to search for a specific field (mail, id ...) and get the ten first occurrences displayed.
User story 3: As a user I want to be able to export a csv file that has the result of the query I wrote in the search engine.
Explanation of the role of the three project main functionalities:
The user will have access to a simple graphical user interface that will let him choose between:
- Import: has as objective to allow importing heavy datafiles
- Export: has as objective to generate a text file with respect to some filter (could be a table name, or a property) specified and fetched from the search input field
- Search: has as objective to filter the data and render at most 10 data rows that matches the search query and render it to the user.
Note: This project generates only one view for the user based on the input in the search field.
Expected features:
1) Simple graphic interface that contains the 3 main Sub functionalities of this interface:
button import, button export (export can be a collection, depends on the filter and the query), Search field (to filter).
• Level of priority: Medium
• Expected programming skills: HTML / CSS / Flask(preferably but not a must)
2) Settle Elassandra cluster (2 databases). Load huge data file into the cluster in order to achieve distributed storage. And finally export the result of a specific query to a csv file.
• Level of priority: High
• Expected programming skills: python, Elasandra NoSQL (Cassandra + Elasticsearch), Apache NiFi
3)Server side for data processing: This will have two main and separate goals:
First: If one of the special fields in already existent, upsert into already existing record the missing fields from the new record and vice-versa.
Second: Implement The search algorithm that will get the expected row, table or even field
• Level of priority: high
• Expected programming skills: python, Apache Spark, Elasandra (Cassandra+Elasticsearch)
Expected result: The project will be deemed successful if we see that the user stories are met, and the database fields are being updated as explained in the section before.
Hi,
I am an experienced Data Engineer with a solid background in Spark.
I have worked on many Big Data projects with Spark, Scala, Python, Cassandra, Snowflake, AWS,...
Let's have a call for more details about the project.
Regards