Tags » Google Summer Of Code

GSOC update 6 - Sharing data and code

Long post ahead :P

As mentioned in my last post, the unit tests in VMS had several potential areas of improvement. One – not reusing code and Second, not reusing data. 663 more words

Open Source

Loklak fuels Open Event !

A general background building….

The FOSSASIA Open Event Project aims to make it easier for events, conferences, tech summits to easily create Web and Mobile (only Android currently) micro Apps. 1,641 more words

jigyasa reblogged this on Being Curious .... and commented:

banner-gsoc2016_2

As Google Summer of Code 2016 has officially begun, I am all excited to be working with FOSSASIA yet again. This time, I have been assigned the project Loklak where I shall be spending the summer working on implementing indexing (harvester/scrapers) for different services like weibo.com, angel.co, meetup.com, Instagram etc.

Loklak is a server application which is able to collect messages from various sources, including Twitter. This server contains a search index and a peer-to-peer index sharing interface.

If one likes to be anonymous when searching things, want to archive Tweets or messages about specific topics and if you are looking for a tool to create statistics about Tweet topics, then Loklak is the best option to consider.

loklak_anonymous.png



Loklak fuels Open Event !

  A general background building.... The FOSSASIA Open Event Project aims to make it easier for events, conferences, tech summits to easily create Web and Mobile (only Android currently) micro Apps. The project comprises of a data schema for easily storing event details, a server and web front-end that are used to view, modify, update this data easily by the event organizers, a mobile-friendly web-app client to show the event data to attendees, an Android app template which will be used to generate specific apps for each event. And Eventbrite is the world's largest self-service ticketing platform. It allows anyone to create, share and find events comprising music festivals, marathons, conferences, hackathons, air guitar contests, political rallies, fundraisers, gaming competitions etc.

Kaboom !

Loklak now has a dedicated Eventbrite scraper API which takes in the URL of the event listing on eventbrite.com and outputs JSON Files as required by the Open Event Generator viz: events.json, organizer.json, user.json, microlocations.json, sessions.json, session_types.json, tracks.json, sponsors.json, speakers.json, social _links.json and custom_forms.json (details: Open Event Server : API Documentation) What do we differently do than using the Eventbrite API  ? No authentication tokens required. This gels in perfectly with the Loklak missive. To achieve this, I have simply parsed the HTML Pages using my favorite JSoup: The Java HTML parser library because it provides a very convenient API for extracting and manipulating data, scrape and parse all varieties of HTML from a URL. The API call format is as: http://loklak.org/api/eventbritecrawler.json?url=https://www.eventbrite.com/[event-name-and-id] And in return we get all the details on the Eventbrite page as JSONObject and also it gets stored in differently named files in a zipped folder [userHome + "/Downloads/EventBriteInfo"] Example: Event URL: https://www.eventbrite.de/e/global-health-security-focus-africa-tickets-25740798421 Screenshot from 2016-07-04 07:04:38 API Call: http://loklak.org/api/eventbritecrawler.json?url=https://www.eventbrite.de/e/global-health-security-focus-africa-tickets-25740798421 Output: JSON Object on screen andevents.json, organizer.json, user.json, microlocations.json, sessions.json, session_types.json, tracks.json, sponsors.json, speakers.json, social _links.json and custom_forms.json files written out in a zipped folder locally. Screenshot from 2016-07-04 07:05:16 Screenshot from 2016-07-04 07:57:00
For reference, the code is as:
/**
 *  Eventbrite.com Crawler v2.0
 *  Copyright 19.06.2016 by Jigyasa Grover, @jig08
 *
 *  This library is free software; you can redistribute it and/or
 *  modify it under the terms of the GNU Lesser General Public
 *  License as published by the Free Software Foundation; either
 *  version 2.1 of the License, or (at your option) any later version.
 *  
 *  This library is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 *  Lesser General Public License for more details.
 *  
 *  You should have received a copy of the GNU Lesser General Public License
 *  along with this program in the file lgpl21.txt
 *  If not, see http://www.gnu.org/licenses/.
 */

package org.loklak.api.search;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.zip.ZipEntry;
import java.util.zip.ZipOutputStream;

import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import org.json.JSONArray;
import org.json.JSONObject;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.loklak.http.RemoteAccess;
import org.loklak.server.Query;

public class EventbriteCrawler extends HttpServlet {

    private static final long serialVersionUID = 5216519528576842483L;

    @Override
    protected void doPost(HttpServletRequest request, HttpServletResponse response)
            throws ServletException, IOException {
        doGet(request, response);
    }

    @Override
    protected void doGet(HttpServletRequest request, HttpServletResponse response)
            throws ServletException, IOException {
        Query post = RemoteAccess.evaluate(request);

        // manage DoS
        if (post.isDoS_blackout()) {
            response.sendError(503, "your request frequency is too high");
            return;
        }

        String url = post.get("url", "");

        Document htmlPage = null;

        try {
            htmlPage = Jsoup.connect(url).get();
        } catch (Exception e) {
            e.printStackTrace();
        }

        String eventID = null;
        String eventName = null;
        String eventDescription = null;

        // TODO Fetch Event Color
        String eventColor = null;

        String imageLink = null;

        String eventLocation = null;

        String startingTime = null;
        String endingTime = null;

        String ticketURL = null;

        Elements tagSection = null;
        Elements tagSpan = null;
        String[][] tags = new String[5][2];
        String topic = null; // By default

        String closingDateTime = null;
        String schedulePublishedOn = null;
        JSONObject creator = new JSONObject();
        String email = null;

        Float latitude = null;
        Float longitude = null;

        String privacy = "public"; // By Default
        String state = "completed"; // By Default
        String eventType = "";

        eventID = htmlPage.getElementsByTag("body").attr("data-event-id");
        eventName = htmlPage.getElementsByClass("listing-hero-body").text();
        eventDescription = htmlPage.select("div.js-xd-read-more-toggle-view.read-more__toggle-view").text();

        eventColor = null;

        imageLink = htmlPage.getElementsByTag("picture").attr("content");

        eventLocation = htmlPage.select("p.listing-map-card-street-address.text-default").text();
        startingTime = htmlPage.getElementsByAttributeValue("property", "event:start_time").attr("content").substring(0,
                19);
        endingTime = htmlPage.getElementsByAttributeValue("property", "event:end_time").attr("content").substring(0,
                19);

        ticketURL = url + "#tickets";

        // TODO Tags to be modified to fit in the format of Open Event "topic"
        tagSection = htmlPage.getElementsByAttributeValue("data-automation", "ListingsBreadcrumbs");
        tagSpan = tagSection.select("span");
        topic = "";

        int iterator = 0, k = 0;
        for (Element e : tagSpan) {
            if (iterator % 2 == 0) {
                tags[k][1] = "www.eventbrite.com"
                        + e.select("a.js-d-track-link.badge.badge--tag.l-mar-top-2").attr("href");
            } else {
                tags[k][0] = e.text();
                k++;
            }
            iterator++;
        }

        creator.put("email", "");
        creator.put("id", "1"); // By Default

        latitude = Float
                .valueOf(htmlPage.getElementsByAttributeValue("property", "event:location:latitude").attr("content"));
        longitude = Float
                .valueOf(htmlPage.getElementsByAttributeValue("property", "event:location:longitude").attr("content"));

        // TODO This returns: "events.event" which is not supported by Open
        // Event Generator
        // eventType = htmlPage.getElementsByAttributeValue("property",
        // "og:type").attr("content");

        String organizerName = null;
        String organizerLink = null;
        String organizerProfileLink = null;
        String organizerWebsite = null;
        String organizerContactInfo = null;
        String organizerDescription = null;
        String organizerFacebookFeedLink = null;
        String organizerTwitterFeedLink = null;
        String organizerFacebookAccountLink = null;
        String organizerTwitterAccountLink = null;

        organizerName = htmlPage.select("a.js-d-scroll-to.listing-organizer-name.text-default").text().substring(4);
        organizerLink = url + "#listing-organizer";
        organizerProfileLink = htmlPage
                .getElementsByAttributeValue("class", "js-follow js-follow-target follow-me fx--fade-in is-hidden")
                .attr("href");
        organizerContactInfo = url + "#lightbox_contact";

        Document orgProfilePage = null;

        try {
            orgProfilePage = Jsoup.connect(organizerProfileLink).get();
        } catch (Exception e) {
            e.printStackTrace();
        }

        organizerWebsite = orgProfilePage.getElementsByAttributeValue("class", "l-pad-vert-1 organizer-website").text();
        organizerDescription = orgProfilePage.select("div.js-long-text.organizer-description").text();
        organizerFacebookFeedLink = organizerProfileLink + "#facebook_feed";
        organizerTwitterFeedLink = organizerProfileLink + "#twitter_feed";
        organizerFacebookAccountLink = orgProfilePage.getElementsByAttributeValue("class", "fb-page").attr("data-href");
        organizerTwitterAccountLink = orgProfilePage.getElementsByAttributeValue("class", "twitter-timeline")
                .attr("href");

        JSONArray socialLinks = new JSONArray();

        JSONObject fb = new JSONObject();
        fb.put("id", "1");
        fb.put("name", "Facebook");
        fb.put("link", organizerFacebookAccountLink);
        socialLinks.put(fb);

        JSONObject tw = new JSONObject();
        tw.put("id", "2");
        tw.put("name", "Twitter");
        tw.put("link", organizerTwitterAccountLink);
        socialLinks.put(tw);

        JSONArray jsonArray = new JSONArray();

        JSONObject event = new JSONObject();
        event.put("event_url", url);
        event.put("id", eventID);
        event.put("name", eventName);
        event.put("description", eventDescription);
        event.put("color", eventColor);
        event.put("background_url", imageLink);
        event.put("closing_datetime", closingDateTime);
        event.put("creator", creator);
        event.put("email", email);
        event.put("location_name", eventLocation);
        event.put("latitude", latitude);
        event.put("longitude", longitude);
        event.put("start_time", startingTime);
        event.put("end_time", endingTime);
        event.put("logo", imageLink);
        event.put("organizer_description", organizerDescription);
        event.put("organizer_name", organizerName);
        event.put("privacy", privacy);
        event.put("schedule_published_on", schedulePublishedOn);
        event.put("state", state);
        event.put("type", eventType);
        event.put("ticket_url", ticketURL);
        event.put("social_links", socialLinks);
        event.put("topic", topic);
        jsonArray.put(event);

        JSONObject org = new JSONObject();
        org.put("organizer_name", organizerName);
        org.put("organizer_link", organizerLink);
        org.put("organizer_profile_link", organizerProfileLink);
        org.put("organizer_website", organizerWebsite);
        org.put("organizer_contact_info", organizerContactInfo);
        org.put("organizer_description", organizerDescription);
        org.put("organizer_facebook_feed_link", organizerFacebookFeedLink);
        org.put("organizer_twitter_feed_link", organizerTwitterFeedLink);
        org.put("organizer_facebook_account_link", organizerFacebookAccountLink);
        org.put("organizer_twitter_account_link", organizerTwitterAccountLink);
        jsonArray.put(org);

        JSONArray microlocations = new JSONArray();
        jsonArray.put(microlocations);

        JSONArray customForms = new JSONArray();
        jsonArray.put(customForms);

        JSONArray sessionTypes = new JSONArray();
        jsonArray.put(sessionTypes);

        JSONArray sessions = new JSONArray();
        jsonArray.put(sessions);

        JSONArray sponsors = new JSONArray();
        jsonArray.put(sponsors);

        JSONArray speakers = new JSONArray();
        jsonArray.put(speakers);

        JSONArray tracks = new JSONArray();
        jsonArray.put(tracks);

        JSONObject eventBriteResult = new JSONObject();
        eventBriteResult.put("Event Brite Event Details", jsonArray);

        // print JSON
        response.setCharacterEncoding("UTF-8");
        PrintWriter sos = response.getWriter();
        sos.print(eventBriteResult.toString(2));
        sos.println();

        String userHome = System.getProperty("user.home");
        String path = userHome + "/Downloads/EventBriteInfo";

        new File(path).mkdir();

        try (FileWriter file = new FileWriter(path + "/event.json")) {
            file.write(event.toString());
        } catch (IOException e1) {
            e1.printStackTrace();
        }

        try (FileWriter file = new FileWriter(path + "/org.json")) {
            file.write(org.toString());
        } catch (IOException e1) {
            e1.printStackTrace();
        }

        try (FileWriter file = new FileWriter(path + "/social_links.json")) {
            file.write(socialLinks.toString());
        } catch (IOException e1) {
            e1.printStackTrace();
        }

        try (FileWriter file = new FileWriter(path + "/microlocations.json")) {
            file.write(microlocations.toString());
        } catch (IOException e1) {
            e1.printStackTrace();
        }

        try (FileWriter file = new FileWriter(path + "/custom_forms.json")) {
            file.write(customForms.toString());
        } catch (IOException e1) {
            e1.printStackTrace();
        }

        try (FileWriter file = new FileWriter(path + "/session_types.json")) {
            file.write(sessionTypes.toString());
        } catch (IOException e1) {
            e1.printStackTrace();
        }

        try (FileWriter file = new FileWriter(path + "/sessions.json")) {
            file.write(sessions.toString());
        } catch (IOException e1) {
            e1.printStackTrace();
        }

        try (FileWriter file = new FileWriter(path + "/sponsors.json")) {
            file.write(sponsors.toString());
        } catch (IOException e1) {
            e1.printStackTrace();
        }

        try (FileWriter file = new FileWriter(path + "/speakers.json")) {
            file.write(speakers.toString());
        } catch (IOException e1) {
            e1.printStackTrace();
        }

        try (FileWriter file = new FileWriter(path + "/tracks.json")) {
            file.write(tracks.toString());
        } catch (IOException e1) {
            e1.printStackTrace();
        }

        try {
            zipFolder(path, userHome + "/Downloads");
        } catch (Exception e1) {
            e1.printStackTrace();
        }

    }

    static public void zipFolder(String srcFolder, String destZipFile) throws Exception {
        ZipOutputStream zip = null;
        FileOutputStream fileWriter = null;
        fileWriter = new FileOutputStream(destZipFile);
        zip = new ZipOutputStream(fileWriter);
        addFolderToZip("", srcFolder, zip);
        zip.flush();
        zip.close();
    }

    static private void addFileToZip(String path, String srcFile, ZipOutputStream zip) throws Exception {
        File folder = new File(srcFile);
        if (folder.isDirectory()) {
            addFolderToZip(path, srcFile, zip);
        } else {
            byte[] buf = new byte[1024];
            int len;
            FileInputStream in = new FileInputStream(srcFile);
            zip.putNextEntry(new ZipEntry(path + "/" + folder.getName()));
            while ((len = in.read(buf)) > 0) {
                zip.write(buf, 0, len);
            }
            in.close();
        }
    }

    static private void addFolderToZip(String path, String srcFolder, ZipOutputStream zip) throws Exception {
        File folder = new File(srcFolder);

        for (String fileName : folder.list()) {
            if (path.equals("")) {
                addFileToZip(folder.getName(), srcFolder + "/" + fileName, zip);
            } else {
                addFileToZip(path + "/" + folder.getName(), srcFolder + "/" + fileName, zip);
            }
        }
    }

}
Check out https://github.com/loklak/loklak_server for more...
  Feel free to ask questions regarding the above code snippet. Also, Stay tuned for the next part of this post which shall include using the scraped information for Open Event. Feedback and Suggestions welcome :)

GSoC -Breath and Review

Well, more than a month ago my GSoC started, and now it’s time to take a breath and made some reviews on my work.

A couple weeks ago I went to Randa Meetings, a sprint of KDE, and there I did a lot of work in Umbrello. 301 more words

Gsoc

GSOC update 5: Refactoring to Minimize Duplication

Code re-factoring refers to restructuring existing code — while ensuring that the code behaviour and functionality is preserved during this process. And as to reasons why one would do it – 310 more words

Open Source

Mid-term eval - GSoC 2016

Hi!

Today I will make a resume about everything that I already did in my Google Summer of Code since the beginning on May 23rd. 441 more words

Kde

Now get wordpress blog updates with Loklak !

Loklak shall soon be spoiling its users !

Next, it will be bringing in tiny tweet-like cards showing the blog-posts (title, publishing date, author and content) from the given WordPress Blog URL. 725 more words

jigyasa reblogged this on Being Curious .... and commented:

banner-gsoc2016_2

As Google Summer of Code 2016 has officially begun, I am all excited to be working with FOSSASIA yet again. This time, I have been assigned the project Loklak where I shall be spending the summer working on implementing indexing (harvester/scrapers) for different services like weibo.com, angel.co, meetup.com, Instagram etc.

Loklak is a server application which is able to collect messages from various sources, including Twitter. This server contains a search index and a peer-to-peer index sharing interface.

If one likes to be anonymous when searching things, want to archive Tweets or messages about specific topics and if you are looking for a tool to create statistics about Tweet topics, then Loklak is the best option to consider.

loklak_anonymous.png



NOW GET WORDPRESS BLOG UPDATES WITH LOKLAK !

  Loklak shall soon be spoiling its users ! Next, it will be bringing in tiny tweet-like cards showing the blog-posts (title, publishing date, author and content) from the given Wordpress Blog URL. This feature is certain to expand the realm of Loklak's missive of building a comprehensive and an extensive social network dispensing useful information. Screenshot from 2016-06-22 04:48:28 In order to implement this feature, I have again made the use of JSoup: The Java HTML parser library as it provides a very convenient API for extracting and manipulating data, scrape and parse HTML from a URL. The information is scraped making use of JSoup after the corresponding URL in the format "https://[username].wordpress.com/" is passed as an argument to the function scrapeWordpress(String blogURL){..} which returns a JSONObject as the result. A look at the code snippet :
/**
 *  Wordpress Blog Scraper
 *  By Jigyasa Grover, @jig08
 **/

package org.loklak.harvester;

import java.io.IOException;

import org.json.JSONArray;
import org.json.JSONObject;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class WordpressBlogScraper {
    public static void main(String args[]){
        
        String blogURL = "https://loklaknet.wordpress.com/";
        scrapeWordpress(blogURL);       
    }
    
    public static JSONObject scrapeWordpress(String blogURL) {
        
                Document blogHTML = null;
        
        Elements articles = null;
        Elements articleList_title = null;
        Elements articleList_content = null;
        Elements articleList_dateTime = null;
        Elements articleList_author = null;

        String[][] blogPosts = new String[100][4];
        
        //blogPosts[][0] = Blog Title
        //blogPosts[][1] = Posted On
        //blogPosts[][2] = Author
        //blogPosts[][3] = Blog Content
        
        Integer numberOfBlogs = 0;
        Integer iterator = 0;
        
        try{            
            blogHTML = Jsoup.connect(blogURL).get();
        }catch (IOException e) {
            e.printStackTrace();
        }
            
            articles = blogHTML.getElementsByTag("article");
            
            iterator = 0;
            for(Element article : articles){
                
                articleList_title = article.getElementsByClass("entry-title");              
                for(Element blogs : articleList_title){
                    blogPosts[iterator][0] = blogs.text().toString();
                }
                
                articleList_dateTime = article.getElementsByClass("posted-on");             
                for(Element blogs : articleList_dateTime){
                    blogPosts[iterator][1] = blogs.text().toString();
                }
                
                articleList_author = article.getElementsByClass("byline");              
                for(Element blogs : articleList_author){
                    blogPosts[iterator][2] = blogs.text().toString();
                }
                
                articleList_content = article.getElementsByClass("entry-content");              
                for(Element blogs : articleList_content){
                    blogPosts[iterator][3] = blogs.text().toString();
                }
                
                iterator++;
                
            }
            
            numberOfBlogs = iterator;
            
            JSONArray blog = new JSONArray();
            
            for(int k = 0; k<numberOfBlogs; k++){
                JSONObject blogpost = new JSONObject();
                blogpost.put("blog_url", blogURL);
                blogpost.put("title", blogPosts[k][0]);
                blogpost.put("posted_on", blogPosts[k][1]);
                blogpost.put("author", blogPosts[k][2]);
                blogpost.put("content", blogPosts[k][3]);
                blog.put(blogpost);
            }           
            
            JSONObject final_blog_info = new JSONObject();
            
            final_blog_info.put("Wordpress blog: " + blogURL, blog);            

            System.out.println(final_blog_info);
            
            return final_blog_info;
        
    }
}
  In this, simply a HTTP Connection was established and text extracted using "element_name".text() from inside the specific tags using identifiers like classes or ids. The tags from which the information was to be extracted were identified after exploring the web page’s HTML source code. The result thus obtained is in the form of a JSON Object
{
  "Wordpress blog: https://loklaknet.wordpress.com/": [
    {
      "posted_on": "June 19, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "shivenmian",
      "title": "loklak_depot \u2013 The Beginning: Accounts (Part 3)",
      "content": "So this is my third post in this five part series on loklak_depo... As always, feedback is duly welcome."
    },
    {
      "posted_on": "June 19, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "sopankhosla",
      "title": "Creating a Loklak App!",
      "content": "Hello everyone! Today I will be shifting from course a...ore info refer to the full documentation here. Happy Coding!!!"
    },
    {
      "posted_on": "June 17, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "leonmakk",
      "title": "Loklak Walls Manual Moderation \u2013 tweet storage",
      "content": "Loklak walls are going to....Stay tuned for more updates on this new feature of loklak walls!"
    },
    {
      "posted_on": "June 17, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "Robert",
      "title": "Under the hood: Authentication (login)",
      "content": "In the second post of .....key login is ready."
    },
    {
      "posted_on": "June 17, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "jigyasa",
      "title": "Loklak gives some hackernews now !",
      "content": "It's been befittingly said  \u... Also, Stay tuned for more posts on data crawling and parsing for Loklak. Feedback and Suggestions welcome"
    },
    {
      "posted_on": "June 16, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "Damini",
      "title": "Does tweets have emotions?",
      "content": "Tweets do intend some kind o...t of features: classify(feat1,\u2026,featN) = argmax(P(cat)*PROD(P(featI|cat)"
    },
    {
      "posted_on": "June 15, 2016",
      "blog_url": "https://loklaknet.wordpress.com/",
      "author": "sudheesh001",
      "title": "Dockerize the loklak server and publish docker images to IBM Containers on Bluemix Cloud",
      "content": "Docker is an open source...nd to create and deploy instantly as well as scale on demand."
    }
  ]
}
  The next step now would include "writeToBackend"-ing and then parsing the JSONObject as desired. Feel free to ask questions regarding the above code snippet, shall be happy to assist. Feedback and Suggestions welcome :)

GSOC update 4: Testing & Refactoring

This week saw me continue the work of Selenium testing which I started off towards the end of last week. This basically involved thinking of possible scenarios that the user might encounter and how the system should respond to them. 293 more words

Open Source